A New Hybrid Method Based on Improved Particle Swarm Optimization, Ant Colony Algorithm and HMM for Web Information Extraction

A New Hybrd Method Based on Improved Partcle Swarm Optmzaton, Ant Colony Algorthm and HMM for Web Informaton Extracton Rong LI, Hong-bn WANG Department of Computer, Xnzhou Teachers Unversty, Xnzhou, Chna Abstract In order to further enhance the accuracy of Web nformaton extracton, and overcome the shortcomngs of the Hdden Markov Model (HMM) and ts hybrd method n parameter optmzaton, a novel Web extracton algorthm based on a combned and mproved partcle swarm optmzaton, ant colony algorthm (IPSO-ACA) and HMM s presented. Frst, an HMM for nformaton extracton s bult. Second, an mproved hybrd ntellgent algorthm combnng PSO wth ACA s proposed. In the new algorthm, nertal weghts of partcle swarm optmzaton and parameters of ant colony algorthm such as stmulatng factor, volatlzaton coeffcents and pheromones are all mproved adaptvely, and then the ftness functon values of partcles hstory optmal solutons are used to adjust the ntal pheromone dstrbuton of the ant colony algorthm. Thrd, the hybrd ntellgent algorthm s adopted for the approxmate global optmal soluton and then Baum-Welch algorthm (BW) s adopted for the local modfcaton, whch not only solves the BW dependency on ntal values and the trapped local optmum problem, but also makes full use of the global search ablty of the hybrd ntellgent algorthm and local development ablty of BW. Fnally, the Vterb algorthm s used to decode the HMM model. Compared wth exstng HMM optmzaton methods, the comprehensve Fβ=1 value s averagely ncreased by 7.3%, whch shows that the mproved algorthm can effectvely enhance optmzaton performance and extracton accuracy. Keywords - nformaton extracton, Hdden Markov Model, partcle swarm optmzaton, ant colony algorthm I. INTRODUCTION Wth the development of Internet technology, Web resources have revealed the ncrease trend n massve and unstructured nformaton. How to recognze those nterestng data for users from unstructured or sem-structured Web nformaton and turn them nto a more structured and clearer semantc format, whch s the technology problem of Web nformaton extracton[1]. A large number of experts and scholars appled statstcal machne learnng methods to ths feld. Typcal statstcal methods manly nclude Hdden Markov Model (HMM)[2-3] and ts hybrd methods[4-8]. For nstance, Zhang et al.[4] Combned bnary HMM and SVM to realze the metadata extracton. Ln et al.[5] proposed text nformaton extracton method based on maxmum entropy and HMM, whch used the weghtng sum of observaton text feature to adjust HMM transton probablty. Xao et al.[6] employed genetc algorthm(ga) to optmze HMM parameters and obtaned the extracton effect superor to tradtonal HMM, but the approach stll reflected the precocous shortcomng of GA. Zou et al.[7] proposed Web nformaton extracton based on smulated annealng(sa) and HMM, but the method dd not consder HMM context features. Wang et al.[8] presented a web extracton algorthm usng mproved PSO and HMM, ts mprovement emboded n nerta weght and the mutaton of part partcles. By Analyss of these HMM lteratures, there s stll lots of room for mprovement n parameter optmzaton and extracton performance. In Inspred by ths, ths paper proposes a self-adaptve hybrd ntellgent optmzaton HMM algorthm for Web ctaton extracton. After constructng an HMM, a selfadaptve hybrd ntellgent optmzaton algorthm based on mproved PSO and ACA(or IPSAA for short) s put forward. The new algorthm realzes the dynamc self-adaptve adjustment n parameters, such as nertal weght of PSO and stmulatng factor, volatlzaton coeffcents and pheromone of ACA, then the ftness functon values of partcles hstory optmal solutons are used to adjust the ntal pheromone dstrbuton of ant colony algorthm, whch s the pont cut of juncture between PSO and ACA. And then ths paper maps the approxmate global optmal soluton, found out by IPSAA, as the ntal model of BW algorthm and adopts BW to contnue to modfy locally parameters. Fnally, the mproved model uses Vterb algorthm to decode for the optmal state sequences. Expermental results ndcate that the IPSAA-HMM algorthm greatly mproves the accuracy of Web nformaton extracton, whch proves the feasblty and effectveness of the IPSAA-HMM algorthm. II. WEB INFORMATION EXTRACTION BASED ON HMM An HMM may be vewed as a fve-tuple (S,O,Π,A,B),where 1S s a state set contanng N states, denoted as S={S 1,S 2,,S N };2O s a symbol set ncludng M output symbols, denoted as O={O 1,O 2, O M};3Π s the ntal state probablty matrx, denoted as N ={ }, =P(q 1=S ), 1 N,0 1, 1 1 ;4A s the state transton probablty matrx; 5 B s symbol output probablty matrx, denoted as M B={b j(o k )},b j(o k )=P(o t=v k q t=s j ), 0 b j(o k ) 1, b j(o k ) 1,1 j N,1k M. k1 DOI 10.5013/IJSSST.a.17.45.39 39.1 ISSN: 1473-804x onlne, 1473-8031 prnt

When HMM s appled to nformaton extracton, model observaton layer s the text sequence to be observed, hdden layer s state sequence composed of state domans such as <Author>, <Ttle> and <Journal>. Extracton process may be descrbed as follows: gven the HMM model λ=(π,a,b )and the observe text sequence O=(O 1,O 2, O T ), ntalze HMM parameters randomly, and use BW algorthm for HMM tranng so as to buld HMM, fnally adopt Vterb algorthm to fnd out the state doman sequence q*=(q 1,q 2,,q T} wth the maxmum probablty P(q O). III. HMM TRAINING ALGORITHM BASED ON IPSAA A. Partcle Swarm Algorthm and Basc Ant Colony Algorthm Partcle swarm optmzaton(pso) algorthm s a global optmzaton algorthm smulatng the movement behavor of brd swarm[9-11]. The soluton of each problem may be seen as a partcle n search pace. Partcles update ther velocty and poston by trackng the ndvdual optmal soluton x pbest and the global optmal soluton x gbest, the updatng equatons are as follows: v(t) v(t 1) cr(t)(x 11 pbest x(t)) c r(t)(x x(t)) (1) 22 gbest x(t) x (t 1) v(t) (2) Where, x (t) and v (t) represent the locaton and velocty of the th partcle n the t th generaton, ω represents nonnegatve nerta weght, c 1 and c 2 are non-negatve learnng factors, and r 1 r 2 are random numbers n [0,1]. Ant colony algorthm s a bonc evolutonary heurstc algorthm, whch s proposed by Dorgo. Through pheromone-nduced effect, ndvdual ants make the later ants choose the shorter path wth stronger pheromone and the algorthm gradually converges to the global optmal soluton. Frst, m ants are randomly placed n n nodes, the probablty of whch the k th ant n note selects the next note j s as follows: [( j( t)] [ j ] k p () t [( ( )] [ ], j l t l j alloedk (3) luk 0, else In Eq.(3), j () t represents the amount of pheromones between and j notes at tme t. j() t ndcates heurstc functon. Parametersα vs β, pheromone heurstc factor vs desred heurstc factor, are used for determnng the sgnfcance between amount of pheromones and dstance nter-node. allowed k ndcates the next note set that ant k s allowed to select. As tme goes on, pheromones left before gradually dsappear. After n moment, ants complete 1 cycle, and the amount of pheromones on each path should be adjusted accordng to Eq.(4). j( t n) j ( t) j, ( 01, ) m k j j k 1 (4) k Q/ lk, j Lk j 0, else k Where j () t ndcates the amount of pheromones left by the k th ant n ths teraton. j ndcates the pheromone ncrement between notes and j n the cycle. ρ s the pheromone evaporaton coeffcent and Q s a constant. L k and l k respectvely ndcate the k th ant s traveled path and length n ths teraton. B. Improvement Strategy of PSO In ths artcle, the mprovement of pso s manly amed at adaptve nerta weght adjustment. The value of nerta weght ω has mportant nfluence on the PSO optmzaton search. Generally, n order to obtan better algorthm performance, usually n the search early stage ω should has a greater value to ensure the partcle swarms strong global search ablty wthn a larger search space and avod premature. And as teratons ncrease, ω should has a smaller value to ensure the partcle swarms local search ablty wthn a smaller search space and enhance the convergence precson. Therefore, the approprate control of nerta weght n the teratve process can balance the global search and local search of algorthm, thereby gettng good enough soluton on average wth less teraton. Tradtonal self-adaptve method makes ω lnearly decrease wth the ncrease of teratons, whch mproves the performance of algorthm, but there are stll some shortcomngs. On the one hand, ths knd of PSO algorthm cant effectvely reflect the complcated non-lnear behavor n the partcle swarms actual search process, so the convergence speed and convergence precson s stll not deal. On the other hand, the slope of whch ω lnearly decreases s stll problem-dependency, there s no unversal optmal change slope for all optmzaton problem. By the prevous analyss, the change process of ω s dynamc and non-lnear, therefore, so ths paper adopts the non-lnear functon to descrbe the dynamc change rule of ω n the teraton process. The ω value of each teratve step s determned by the followng exponental functon formula: ternow n wter ( ) wnt exp( ( ) ) (5) termax DOI 10.5013/IJSSST.a.17.45.39 39.2 ISSN: 1473-804x onlne, 1473-8031 prnt

In Eq (5), n s a control power exponent of non-lnear change rule, partcularly when n=2, (5) s often referred to as probablty curve functon. Fgure (1) shows the ω teraton change curve wth dfferent n value. ω value 0.5 0.4 0.3 0.2 n=1.25 n=1 n=2 n=1.75 n=1.5 n=0.75 0.1 0 0.5 1 ter now /ter max Fgure 1. ω Iteraton Change Curve wth Dfferent n Value. A shown n Fg.(1)., for the gven ntal value ω nt and the control power exponent n, the non-lnear change rule of ω wth teratons can be unquely determned. And the greater the n value s, the longer the global search duraton of partcle swarm s, whle the smaller the n value s, the longer the local search duraton of partcle swarm s. By d 2 usng d-dmenson sphercal functon f( x) x (d=6,x 1 [-5.12,5.12]) to valdate parameter values, the result shows that when ω nt value s set n [0.2,0.5], and n value s set n [0.5,2], the algorthm has excellent performance. C. Improvement Strategy of ACA Seekng a balance between "exploraton" and "explotaton" s one of the key ssues n the study of ant colony algorthm[12]. In order to fnd a balance between obtanng a new path and usng pror knowledge, two aspects should be consdered for the mprovements of ACA. On the one hand, the search space of ACA can be made as large as possble to search the soluton regon of possble optmal soluton. On the other sde, the current effectve nformaton wthn ant colony should be taken full advantage of so that the searchng emphass of ant colony algorthm focuses on ndvdual ntervals wth hgher ftness value, and thus the algorthm may converge to the global optmal soluton wth the greater probablty. The convergence speed of ant colony algorthm should be mproved as far as possble under the premse of fndng the global optmal soluton. In ths paper, an adaptve strategy s adopted to resolve the man contradcton between performance and convergence rate. 1) Adaptve adjustment of pheromone heurstc factorαand desred heurstc factor β When α=0, only path pheromone works, the algorthm s equvalent to the shortest path searchng, whch s the tradtonal greedy algorthm. And when β=0, the heurstc functon of path pheromone s 0, the algorthm s equvalent to the blnd random search, whch s purely heurstc algorthm wth postve feedback. At frst the ants do not understand the stuaton of the lnk, the pheromone on the lnk has lttle effect on wayfndng ants. Along wth the ncreasng of teraton tmes, the pheromone on the lnk s more and more mportant to wayfndng ants, In the end, the probablty of whch the wnner lnk s selected s larger and larger, thus the convergence speed of the wnner lnk s faster and faster, and fnally the optmal path s found. So the α and the β value may be adaptvely adjusted accordng to Eq.(6), such ncentve mechansm can speed up convergence and mprove search qualty. 54e 1 4e NC NC 2) Adaptve mprovement of evaporaton coeffcent ρ When the problem scale s large, the pheromone evaporaton coeffcent ρ makes the amount of nformaton of soluton, whch has never been searched, be reduced to close to zero, and reduces the global search ablty of algorthm. If ρ s too large, then when the amount of nformaton of the soluton ncreases, the selected lkelhood of those prevous solutons s too large, the global search ablty of algorthm wll declne. If ρ s too small, the convergence speed of algorthm wll be too slow. Based on the comprehensve consderaton for the global search ablty and convergence speed, ρ may be changed to threshold functon, namely when the algorthm optmum value does not sgnfcantly mprove wthn N cycles, ρ s updated accordng to the followng functon. mn (6) (), t () t ( t n) mn else (7) Where, the ntal value of ρ s 1. The mnmum value of ρ, mn, can prevent too small a ρ value from reducng the convergence speed of algorthm. γ ndcates volatle constrant coeffcent and γ (0,1]. In order to reasonably select ρ value, γ s expressed as a gradual process, whch can makeγ value dynamcally reduce wth ncreasng number of teratons. Its functon s as follows: ( ter) 2 ter ter 1 e max 0 In Eq.(8), 0 s an ntal maxmum of.φ s a postve coeffcent of adjustng the changng speed of. Iter s the teraton steps or search tmes. ter max s total number of teratons. 3) Adaptve mprovement of pheromone The exstence of pheromone evaporaton coeffcent makes the amount of pheromone on those paths, whch have never been searched, close to zero, thereby reducng the search ablty on these paths. On the contrary, when (8) DOI 10.5013/IJSSST.a.17.45.39 39.3 ISSN: 1473-804x onlne, 1473-8031 prnt

pheromone on a path s larger, the amount of nformaton on these paths ncreases, the opportunty of agan selectng those paths wll become larger, whch also affects the global search ablty of algorthm. In alluson to the problem, the pheromone value may be changed and updated accordng to formula (9). 1 ( m) ( 1 ) j ( t) j, f 1 ( m) ( 1 ) j ( t) j, f max j ( t 1) (9) max Where ψ(m) s a functon proportonal to the number of convergence m, the more the number of teratons s, the greater the value of ψ(m) s, such as ψ(m)=m/ct, where ct s a constant and m ndcates the number of contnuous convergence. In ths way, the algorthm dynamcally updates pheromone accordng to the dstrbuton of solutons, consequently dynamcally adjusts the pheromone ntensty on each path, so that the ants nether too concentrated nor too decentralzed, and thus avodng premature and local convergence and mprovng the global search capablty. D. HMM Optmzaton Tranng of Improved Partcle Swarm-Ant colony Algorthm (IPSAA) Partcle swarm optmzaton algorthm has better searchng ablty and faster convergence speed, but t has no advantage n the combnatoral optmzaton problems. Ant colony algorthm can make up for ths shortcomng, but t has the shortcomngs such as blndness and slow search speed n the ntal search. Ths paper combnes them and gves play to the complementary advantages to optmze HMM parameter λ = (π, A, B). 1) Connecton of PSO and ACA In the IPSAA algorthm, the ntal poston of ant corresponds to the optmal poston of each partcle n PSO. The ftness functon value of each partcle s hstory optmal soluton s used to adjust the ntal dstrbuton of pheromone n ant colony algorthm ACA, and the ACA ntal pheromone formula s shown as follows. f( x ) ka (10) s mn In Eq.(10), ndcates the mnmum pheromone mn constant. x ndcates the ant poston correspondng to the optmal partcle poston, and f(x ) s ts ftness functon. ( ) ka f x s the pheromone value converted from the PSO result,k s a constant greater than zero, 0<a 1. And thus, the greater f(x ) s, the more pheromones s here. 2) Coarse search of PSO for HMM parameters rough optmzaton In the frst phase, partcle swarm optmzaton s used to optmze HMM parameters. A partcle corresponds to an HMM, and the elements of partcles poston vector X s the lnear arrangement of HMM parameter λ = (π, A, B). Therefore, the dmenson of partcles search space s the sum of number of A, B andπelements, a total of N+N*N+N*M dmensons, and each partcle s a real coded strng of N+N*N+N*M dmensons. The optmal soluton wth the maxmum ftness, X best, can be obtaned by PSO coarse search. In ths algorthm, In order to make all the sequence wth the maxmum probablty, to make the model better explan the observed sequence, the logarthmc mean of probablty of the observed sequence s used to measure the qualty of the model. The ftness functon s as follows: L 1 k L k1 f( x ) f( ) ln( po ( )) (11) Where, λ s a composte HMM correspondng to the th partcle. L s the number of sequence observatons, O k ndcates one of the observaton sequences, whose length s T. The probablty value p(o λ ) can be calculated by forward-backward algorthm of HMM. Takng logarthm of the probablty s to avod an underflow probablty multplcaton of. Defnton (11) of ftness functon can also be appled to the subsequent colony algorthm. Constrants of HMM s probablty parameters λ s nonnegatve, [0,1], and that sum of probabltes should be one. Those partcles of dssatsfyng the constrants n each generaton need to be normalzed., that s, negatve probablty should be set to 0, all the probabltes a j should N be replaced by a a / a to meet the requrement that sum j j j j1 of all probabltes should be one. The parameters of Vectorπand matrx B are normalzed n the same way. 3) Fne search of ACA for HMM parameters elaborate optmzaton In the second stage, ant colony algorthm s used to search elaborately for the further optmal soluton of HMM parameters. Fnally, Baum-Welch algorthm s adopted for local modfcaton, and the fnal HMM optmzaton parameter λ s obtaned. In Ant colony algorthm, a contnuous search space Ω, on behalf of a set of all HMM parameters A, B, π, s frstly establshed. The dmenson of search space s the sum of dmensons of A, B and π, that s N*M+N*N+ N dmensons. It can be expressed as x=[π 1,,π N, a 11,,a NN,b 11,,b NM ] T,, whose parameters are the same as the HMM s parameters. X can also be smply expressed as: x= (x 1,x 2,,x n ), 0 x 1, =1,2,,n A pont n the space represents a soluton, once the correspondng x s determned, the values of HMM s A, B and π can be determned. Accordng to the forward backward algorthm, the value of correspondng P(O λ) can L be calculated. Let ( ) ( ) 1 k f x f ln( po ( )), the algorthm L k1 uses t to search the correspondng pont, so that the value of f(x) s the maxmum, then the correspondng λ value of HMM can be determned. In ant colony algorthm, the search s manly dvded nto two operatons: searchng soluton and updatng pheromone. The IPSAA algorthm n the paper was mproved and extended manly for the two operatons. Ants search globally accordng to regonal probablty selecton rules, and search locally and randomly wthn the radus of δ at the same tme, DOI 10.5013/IJSSST.a.17.45.39 39.4 ISSN: 1473-804x onlne, 1473-8031 prnt

through whch ants move to fnd the optmal feasble soluton. Once an ant fnds a better soluton, t modfes the relevant pheromone concentraton to attract other ants to further search. a) Search operaton IPSAA algorthm assumes the exstence of the ant colony Q consstng of m ants, ts task s to fnd the current optmal pont X best,whose functon F(X best ) s the largest, n the soluton space composed of HMM parameters. By Eq.(9), the pheromone content s assgned to the ntal value n the regon ntalzaton phase, and each regon s represented by ts center pont poston x. Ants traverse these areas nstead of searchng the entre space, assumng that the set of the regonal mdpont s X R. In each round of search, m ants are allocated to each regon for the optmal soluton search. Ants choose the area accordng to the probablty decson rule whch s a functon of the local avalable pheromone and heurstc nformaton. The mprovement s as follows: (1)Decson Rules for Regonal probablty f px ( X ) [( ( x)] [ ( x)] [( ( x)] [ ( x)] (12) (( x ) Where ndcates the pheromone content of regonal center. ( x ) ndcates the heurstc nformaton, whch s f(x ) correspondng to the regonal center pont x. It can be calculated by Eq.(12), α, β are all postve parameters whch determne the role of pheromone and heurstc nformaton on the role of the selecton probablty. The larger α value s, the more the algorthm s nclned to the development of known search experence. Whereas the larger β value s, the stronger exploratory ablty the algorthm has. Ants choose the area accordng to the probablty decson rule and ants frst are located on the center pont of the selected area. Based on the dea of API algorthm[13], the area center pont s regarded as an ant nest, several ponts n the regon are selected as huntng spots. To ensure that the generated huntng ponts also satsfy the constrants of HMM learnng problems, the paper ntroduces a feasble soluton generaton rule to generate the relevant huntng spots and other search ponts. (2)Generaton Rule for Feasble Solutons Let nput pont be ( 1, 2,..., x x x x n ), where x s the vector element of the pont, s the current dmenson, and vbraton varable δ [0,r]. For x when [1,N], the p algorthm selects C N feasble ponts and makes each p x x,then selects C feasble ponts and makes each N x x,p [0,N/2]. For x when [N+1,N+N], the algorthm does the same thng. Then the algorthm postpones the nterval n bts, and then does the same thng, and so on, untl N*N+N. For x when [N*N+N+1, p N*N+N+N], the algorthm selects C M feasble ponts and p makes each x x,then selects C feasble ponts and M makes each x x,p [0,M/2]. The algorthm postpones the nterval n bts and does the same thng, and so on, untl N*N+N*M+N+1. And then the algorthm wll analyze whether each vector value of the newly generated pont x meets the condtons x r x x r, those ponts whch don t meet the condtons wll be abandoned. Based on Rule 2, a pont set Ω composed of N / 2 N / 2 P P P P NM ( C N C N ) ( CM CM ) P 0 P 0 dots can be generated. The δ value s enlarged or reduced n a certan vbraton order to make the selecton of the pont be unformly dstrbuted throughout the regon. The new algorthm randomly selects p ponts as huntng spots from Ω. b) Pheromone update operaton At the begnnng of each round of search, ants frst select the regon n accordance wth the pheromone dstrbuton of each regon. Regonal pheromone content s equal to the sum of pheromone contents of each huntng ponts n the regon. Ant departures from the nest, randomly selects a huntng pont to start the search. In the search process, f the s b operaton of x x happens, whch means the ant fnds the better huntng pont than the current one at the end of search near the huntng pont, the algorthm replaces the orgnal huntng pont wth the current pont and ncreases the pheromones correspondng to the current huntng spot. The update formula of pheromone ncrement s as shown n Eq.(13). j j Q (13) In the IPSAA algorthm, ants select area to search based on the regon pheromone content, and n the area dfferent ants also exchange nformaton. The nternal nformaton nteracton gudes ants to search near the huntng pont of better ftness functon. 4) HMM Tranng algorthm based on IPSAA The concrete steps of HMM tranng algorthm based on IPSAA are as follows: STEP 1: Defne the ftness functon F(x) and ntalze PSO parameters: ncludng populaton sze S, the largest cycle tmes Itermax1, learnng factors c1 and c2, nerta weght W, and random ntalzaton for partcle s poston and velocty wthn the allowable range; // random ntalzaton for HMM parameters; STEP 2: Calculate the functon value of each partcle accordng to Eq.(11). STEP 3: Compare the ftness value of each partcle respectvely wth Indvdual extremum Pbest and global extremum Gbest, and f better, then respectvely substtute, otherwse, reman unchanged; STEP 4: Adaptvely update velocty and poston of partcles accordng to Eq.(1),(2), STEP 5: Restrct and normalze partcles poston;// Restrct and normalze HMM parameters. STEP 6: If the termnaton condton s satsfed (error s good enough or the algorthm reaches PSO s largest cycle DOI 10.5013/IJSSST.a.17.45.39 39.5 ISSN: 1473-804x onlne, 1473-8031 prnt

tmes Iter max1 ), termnate PSO s optmzaton process, and obtan the best hstory poston of each partcle, otherwse, return to STEP 2; STEP 7: Intalze the ant colony s maxmum cycle tmes termax2 and ants search radus δ, ntalze the poston of ant colony accordng to the optmal hstory poston of each partcle, and ntalze pheromone based on Eq.(9), let j 0,ter2=1, and fnd the best ftness value and the correspondng poston STEP 8: Each ant selects area accordng to the regon probablty decson rules n Eq.(6),(12) and locally searches the huntng spot wthn the radus of δ, f a good soluton s searched locally, then t s replaced, and then the pheromones of huntng pont s adaptvely ncreased accordng to the equaton (7), (8), (9) and (13). STEP 9: Update the optmal ftness value and the correspondng poston. STEP 10: Expand the search radus of ants, ter2 ++, f ter2 <termax2, then go to STEP 8; STEP 11: Output the optmal soluton after ACA s fne search; STEP 12: Take the above IPSAA optmzaton soluton as the nput parameters of Baum-Welch, and locally revse B-W algorthm to obtan the fnal HMM parameter results. Fgure. 2. Flowchart of IPSAA Parameter Tranng The flowchart of IPSAA algorthm s llustrated n Fg.(2). IV. PSO Intalzaton Update partcles poston and velocty Update the ndvdual extreme, local extreme and ther postons Constrant normalzaton recursve teraton Generate the Hstorcal optmal soluton of each partcle Intalze ACA parameters, generate the ntal pheromone dstrbuton based on the optmzed soluton Ants select regon accordng to probablstc decson rule Fnd the huntng pont accordng to Feasble soluton s generaton l Update Pheromone Output HMM parameters after IPSAA optmzaton Baum-Welch algorthm s local revson Output the fnal HMM parameters recursve teraton WEB INFORMATION EXTRACTION BASED ON IPSAA-HMM A. Extracton process based on IPSAA-HMM Ths paper bulds an mproved IPSAA-HMM model. By Usng ctatons n Web research papers as treatment objects, the model extracts state domans n reference such as <Author> <Book> <Ttle> <Journal>. The extracton process of the mproved model s as follows: (1)Informaton preprocessng. Ths artcle frst uses the peweb tool from www.jast.ac.jp/~heuxuan/softwares/peweb webste for Web ctaton record extracton; and then utlzes delmter such as punctuaton and text features for nformaton chunkng pretreatment. Among them, the determnstc text features are characterstc word Journal correspondng to journal state doman <Journal>, characterstc words Conference, Proceedngs and Symposum correspondng to conference proceedngs state doman <Conference> and characterstc words Press, Publshers correspondng to press state doman < Press > and so on. (2)Model tranng. After ntalng HMM parameters randomly, the IPSAA algorthm s adopted to optmze HMM parameters and then BW algorthm s used to modfy HMM parameters locally, whch bulds an mproved HMM. (3)Informaton extracton. Vterb algorthm s employed to obtan the optmal state sequence of test sample. The specfc extracton process s shown n Fg.(3). Fgure 3. Web extracton process. B. Expermental results and analyss 2800 unlabeled research paper ctatons are used as expermental samples, a part of whch are 800 ctaton data sets (http://www.cs.cmu.edu/~kseymore/e.html) from the Unted States Carnege Mellon Unversty (CMU), the other part of whch are 2000 lterature records from 398 research papers extracted randomly from onlne journal database. We select 1900 ctaton records as tranng sets, totalng 45,102 words, the other 900 ctaton records as open test sets, totalng 16,104 words. In HMM optmzaton tranng process, the hybrd tranng parameters are as follows: the populaton sze S=30,Iter max=200, the ntal nerta weght ω nt=0.5, the control parameter of self-adaptve nerta weght n=1.25, the learnng factors c 1 =c 2 =2, the maxmum volatle constrant coeffcent 0 1, the postve coeffcent of adjustng, φ=2.5, the mnmum pheromone mn 0. 0001, max 900, Q=1, m 30, 05., BW teraton threshold ε=1e-5, the HMM state number N=11, the ntal value of λ s selected randomly. We use PSO-HMM, ACA-HMM and IPSAA- BW algorthm n ths paper to tran HMM and analyss the convergence of three algorthms, ther samplng error formula s defned n (14). DOI 10.5013/IJSSST.a.17.45.39 39.6 ISSN: 1473-804x onlne, 1473-8031 prnt

N M 1 2 ( po ( )) ( po ˆ ( ) po ( )) N M 1 j1 (14) TABLE.2. AVERAGE FΒ=1 COMPARISON OF THREE ALGORITHMS Where ( po ˆ ( ) s the probablty of tranng model, po ( ) s the probablty of sample generaton model. The comparson results are llustrated n Fg.(4). standard devatons 0.4 0.3 0.2 0.1 PSO-BW ACA-BW IPSAA-BW 0 0 50 100 150 200 generatons Fgure 4. Standard devtaton comparson of three algorthms As shown n Fg.(4), the standard error of PSO-HMM and ACA-HMM begn to converge close to 0.18 and 0.11, respectvely, whle that of IPSAA-BW begns to converge only close to 0.04, ts standard error reduces respectvely by about 14% and 7% compared wth the prevous two algorthms. It proves that the mproved algorthm has stronger search ablty, convergence speed and very low error, can more accurately tran HMM model, so as to mprove system qualty. At the same tme, Fg.5 shows the mproved algorthm has better stablty. The nformaton extracton results of three optmzaton algorthms are shown n Table.1. TABLE.1. State EXTRACTION PRECISION AND RECALL COMPARISON OF THREE ALGORITHMS PSO-HMM ACA-HMM IPSAA-HMM Precson Recall Precson Recall Precson Recall Author 0.91367 0.91413 0.91945 0.92219 0.96749 0.95342 Ttle 0.75854 0.80937 0.83156 0.84101 0.9074 0.92167 Book 0.80480 0.79412 0.83167 0.80467 0.84823 0.89582 Journal 0.79430 0.81583 0.81042 0.86312 0.90267 0.92551 Conference 0.76784 0.78756 0.82357 0.85382 0.85991 0.92491 Press 0.85668 0.87337 0.85945 0.88057 0.91108 0.94702 Cty 0.85493 0.85568 0.85462 0.85998 0.87672 0.88523 Volume 0.82558 0.85801 0.82962 0.81034 0.91945 0.93792 No. 0.89330 0.91080 0.90357 0.88382 0.91516 0.92359 Year 0.86318 0.90436 0.88164 0.89314 0.92871 0.95909 Pages 0.87008 0.85989 0.86245 0.87158 0.91455 0.94965 Average 0.83663 0.85301 0.85527 0.86220 0.90488 0.929439 PSO-HMM ACA-HMM IPSAA-HMM Fβ=1 0.83597 0.86031 0.92145 TABLE.3. TIME PERFORMANCE COMPARISON OF THREE ALGORITHMS PSO-HMM ACA-HMM IPSAA-HMM t/s 13.57 18.54 14.89 As shown n Table.1, 2, the extracton precson and recall of IPSAA-HMM are all much hgher than the prevous two algorthms. Measurng from the average comprehensve ndex F β=1 value, IPSAA-HMM ncreases respectvely by 8.5% and 6.1% than the prevous two. The precson and recall of states <Journal>, <press> and <conference> ncrease sgnfcantly, manly because the combnaton of determnstc feature nformaton wth hybrd optmzaton model enhances extracton performance. The nterference of state felds such as <book> versus <ttle>,<journal> versus <conference> s strongest, but due to text features and the mproved HMM, the precson rates of states <ttle>, <Journal> and the recall rates of states <book>,<conference> are greatly enhanced. Whle the nterference of <Author> s smaller, ts accuracy rate reached 0.96749. Table.3 shows that the mproved hybrd algorthm has the better tme performance especally than ACA, whch s manly because the new algorthm can avod the blndness n early phrase of ACA and can quckly converge, whch also reflects the effcency of the hybrd algorthm. Syntheszng the above data, t can be seen that the valdty of IPSAA-HMM algorthm. V. CONCLUSIONS In vew of the defects of the tradtonal HMM hybrd method for Web nformaton extracton, ths paper proposes a self-adaptve hybrd ntellgence tranng algorthm based on IPSAA-BW for ctaton extracton. The IPSAA-BW tranng algorthm adjusts adaptvely parameters of PSO and ACA, and takes advantage of the PSO s strong global search capablty to generate the ntal nformaton dstrbuton (Rough search), and then uses the ACA s postve feedback mechansms to obtan the exact solutons (fne search) thus greatly mprovng the performance of HMM parameter optmzaton. And then the algorthm uses BW to revse parameters locally, whch consders the nfluence of the nformaton contaned n tranng sequence and socal nformaton on HMM global optmzaton. So the new hybrd algorthm enhances the probablty of model global optmzaton, effectvely overcomes premature, quckly converges wth extremely low error and has stronger optmzaton ablty. Expermental results show that compared wth the tradtonal PSO-HMM and ACA-HMM optmzaton method, IPSAA-HMM reflects the strong advantage n optmal performance and extracton accuracy, DOI 10.5013/IJSSST.a.17.45.39 39.7 ISSN: 1473-804x onlne, 1473-8031 prnt

whch proves the effectveness of the mproved algorthm. Future researches can focus on: use new hybrd ntellgent algorthms to development HMM optmzaton method wth better performance and lower complexty, and then apply t to the actual ntellgent nformaton processng system. ACKNOWLEDGEMENTS We would lke to thank to the revewers for ther helpful comments. Ths work was fnancally supported by the Natural Scence Foundaton of Chna (#1072166), the Hgher School Scence and Technology Development Project n Shanx Provnce of Chna(#2013147), and the key dscplne constructon project of Xnzhou Teachers Unversty (# XK201403). REFERENCES [1] X. Chen, T. Fang., H. Huo, D.R. L, Measurng the Effectveness of Varous Features for Thematc Informaton Extracton From Very Hgh Resoluton Remote Sensng Imagery, IEEE Transactons On Geoscence And Remote Sensng, vol.53,pp.4837-4851,2015 ]2] M. Marcnczuk, M. Paseck, Study on named entty recognton for polsh based on hdden Markov models,proceedngs of Text, Speech and Dalogue-13th Internatonal Conference (TSD 2010),pp.142-149,2010. [3] K.R. L, Z.K. Kong, G.X. Chen, and J.W. Zhu, Research on mproved HMM-based text categorzaton, Mcroelectroncs & Computer, vol.29,no.11,pp.160-165, 2012. [4] M. Zhang, P.Yn, and Z.H. Deng, SVM+BHMM : A Hybrd Statstcal Model for Meta Data Extracton, Journal of Software, vol.19,no.2,pp.358-367,2008. [5] Y. P. Ln, Y.Z. Lu, S. X. Zhou, Z. P. Chen, and L. J. Ca,, Usng hdden Msrkov model for text nformaton extracton based on maxmum entropy, Acta Electronca snca,.vol.33,no.2,pp.236-240,2005 [6] J. Y. Xao, L.M. Zou, and C. Q. L, Hybrd genetc algorthm and hdden Markov model for web nformaton extracton, Computer Engneerng and Applcatons, vol.44,no.18,pp.132-135,2008 [7] L.M. Zou., X. J. Gong, F. Xao, and S. P. Ma, Web nformaton extracton based on smulated annealng algorthm and hdden Markov model, Journal of Unversty of South Chna,,vol.25,no.1,pp.70-74,2011 [8] C. Wang, D. Q. Duan, and X. D. Wang, An mproved PSO and HMM algorthm for web nformaton extracton, Journal of Henan Normal Unversty(Natural Scence),vol.38,no.5,pp:65-68,2010. [9] Y.L. Chen, B.L. Zhong, Facal expresson recognton based on HMM and PSO, Computer Engneerng,vol.34,no.13,pp.190-192,2008 [10] S. J. Yang, S. W. Wang, J. Tao, and X. Lu, Mult-objectve optmzaton method based on hybrd swarm ntellgence algorthm, Computer smulaton, vol.29,no.6,pp.218-222,2012. [11] Sh, Y., Eberhart, R. C., Fuzzy self-adaptve partcle swarm optmzaton, In: Proceedngs of the IEEE Congress on Evolutonary Computaton. Pscataway, NJ: IEEE Servce Center, pp.101-106,2001. [12] X. Zhou, Y.H. Lu, J.D. Zhang, T.M. Lu, and D. Zhang, An ant colony based algorthm for overlappng communty detecton n complex networks, Physca A: Statstcal Mechancs and ts Applcatons, vol.427, no.1,pp.289 301,2015 [13] Mah, M., Baykan, OK., and Kodaz, H., A new hybrd method based on Partcle Swarm Optmzaton, Ant Colony Optmzaton and 3-Opt algorthms for Travelng Salesman Problem, Appled Soft Computng, vol.30,pp: 484-490, 2015. DOI 10.5013/IJSSST.a.17.45.39 39.8 ISSN: 1473-804x onlne, 1473-8031 prnt