A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION 1 FENG YONG, DANG XIAO-WAN, 3 XU HONG-YAN School of Informaton, Laonng Unversty, Shenyang Laonng E-mal: 1 fyxuhy@163.com, dangxaowan@163.com, 3 xuhongyan_lndx@163.com ABSTRACT A great amount of duplcated enttes exst n the heterogeneous data sources of Web. How to dentfy these enttes s the most mportant prerequste for pattern matchng and data ntegraton. To solve the defcency of current entty recognton methods, such as low automatc level and poor adaptablty, Deep Web entty recognton method based on BP neural network s proposed n ths paper. To make full use of the characterstc of autonomc learnng of BP neural network, the smlartes of semantc blocks are used as the nput of BP neural network. It obtans correct entty recognton model through tranng, and accomplshes the target of automated entty recognton of heterogeneous data sources. Fnally, the feasblty of the proposed method s verfed va experments. The proposed method can reduce manual nterventon and promote the effcency and the accuracy of entty recognton smultaneously. Keywords: Deep Web; BP Neural Network; Enttes Identfy; Smlarty 1. INTRODUCTION As the development of Web technology and the popularzaton of applcaton, the resource of the nternet nformaton s ncreased sharply. Web data sources can be dvded nto Surface Web and Deep Web accordng to the amount of nformaton t contaned. Usually, the Web data source whch can not be searched by the tradtonal search engne s defned as Deep Web [1]. Durng studyng the Deep Web, enttes recognton s the prmary premse about developng pattern matchng, and data ntegraton. So how to recognze the same enttes accurately and effectvely n the Deep Web has become the research focus n the feld of Deep Web. In Deep Web entty recognton feld, relevant scholars have carred out n-depth research, obtaned some representatve research results, such as Tejada, Knblock and Mnton put forward a seres of methods based on pattern matchng and publc attrbute weghts whch gve strng converson rules and apply n the publc attrbute to recognze the entty. The method ncreases the accuracy of enttes recognton, but t needs a pattern matchng as the prerequste, and needs a lot of manual nterventon []; Cu, Xao and Dng put forward the adaptve Web record matchng method based on the dstance, ths method ncreases the effcency of enttes recognton, but t s dffcult to applcaton n bad structured Web data [3]; Sarawag put forward the method that fnd equvalent unts n a gven table model, and then compare the smlarty of the correspondng attrbutes of the unts to conduct enttes recognton. Ths knd of method shortens the recognton tme, mproves the recall rate, but the expandablty of the method s poor [4]. Am to solvng the problems of the exstng research results such as automaton level s not hgh and the adaptablty of heterogeneous Web data source s poor, BP neural network technology s used n enttes recognton of Deep Web. The structure of ths paper s gven as follows. Secton ntroduces the theory of BP neural network. Secton 3 presents the Deep Web enttes recognton method base on BP neural network n detal. Secton 4 gves the specfc experment to verfy the practcablty of the proposed method; Secton 5 summarzes the whole paper and ponts out the advantages of the proposed method.. BP NEURAL NETWORK OVERVIEW Artfcal Neural Network (ANN) s a mathematcal model whch s bult on the bass of autonomous learnng to mtate the structure and functon of bologcal neural network. It can complete extremely complex pattern extracton or trend analyss by the analyss of a large number of complex data. 1175

The BP neural network learnng process s dvded nto two stages, namely nformaton forward propagaton stage and error back propagaton stage. In the nformaton forward transmsson stage, the nput samples enter from the nput layer, and then transport to output layer after processng by hdng layer. If there s large error between the actual output of the output layer and the expected output, the network wll conduct the error back propagaton stage. In the error back propagaton stage, the man task s correctng the threshold values of all neurons and the weghts of the connecton. Repeatng nformaton forward propagaton and error back propagaton untl error value meet the requrements, the learnng process s over. So the correct neural network model can be gotten. 3. DEEP WEB ENTITIES RECOGNITION METHOD BASED ON BP NEURAL NETWORK In heterogeneous Web data sources, the same entty has dfferent forms. Deep Web enttes recognton can detect same enttes from dfferent data source, n order to lay a sold foundaton for elmnatng data redundancy and comparng the same entty. Combnng the BP neural network wth the Deep Web enttes recognton can make full use of autonomous learnng of ANN to reduce manual nterventon and mprove the accuracy of recognton. Deep Web entty recognton method whch s base on BP neural network s dvded nto three man steps whch are to dvde the enttes nto blocks frstly, calculate semantc block smlarty secondly, tran enttes recognton model thrdly. The detaled ntroducton s as followed. 3.1 Dvde the enttes nto blocks The enttes are dvded nto some blocks accordng to an attrbute or the combnaton of some attrbutes whch have the relevant meanng n a semantc block. Through the analyss of the enttes, t can be found that the entty not only contans the nformaton of content, but also ncludes the metadata nformaton whch can nfluence the enttes recognton. It s necessary to preprocess the content of the enttes nformaton before tranng the enttes recognton model. Frst, the nformaton of content s dvded nto blocks. Then the semantc block s processed as a text. Thrdly, the entty of semantc block s compared wth other enttes blocks. Fnally, the smlartes value of semantc block are calculated as the nput of the BP neural network. Ths paper uses the partton method based on Web page layout tags for semantc structure. In the Deep Web stes of a gven feld, the semantc blocks contents of the entty are roughly smlar, only the extremely ndvdual semantc blocks are owned by a unque ste. Therefore, all learnng samples enttes should be dvded nto semantc blocks and the dvded blocks belong to a known attrbute. So the attrbute text labels can statstc correspondng to these semantc blocks. Then calculatng unon set and formng a set whch ncludes all attrbutes text labels correspondng to these semantc blocks wll be done. The set s recorded as A = {A 1, A,...,A n }, there n means the number of text labels whch are corresponded to the semantc blocks. After semantc block dvson of any entty n the data source, the entty can be represented by a semantc block set S = {S 1, S,..., S n }, there S means attrbute text labels of semantc block. If attrbute text label can not be found, S s empty. 3. Calculate semantc block smlarty After dvdng the entty nto blocks, snce each block has the dfferent contrbuton to enttes recognton, the smlarty between the semantc blocks and the enttes must be calculated. It s used as the nput of the neural network tranng. Specfc steps nclude dvdng the entty A to n blocks to form the correspondng semantc blocks set S={ S 1, S,..., S n }, calculatng the smlartes between each block wth the entty B to get a group of smlar values and expressng them as a set of vectors, record as T=(Sm(S 1,B),Sm(S,B),,Sm(S n,b)), Sm(Sx,B) whch means the smlarty for any block of entty A wth entty B. Because of the poor structure of HTML documents, the concept of pattern matchng wll ncrease the complexty of the calculaton. So n ths paper entty A wll be dvded frstly. Then the smlartes between the block wth entty B wll be calculated. Durng the calculaton, dfferent methods are used accordng to the dfferent feature attrbute types. (1) Non-strng type. For the propertes of nonstrng type, such as numerc, currency type, the smlartes can be calculated accordng to the range dstance between them. The block of entty A s recorded as non-strng t 1 and the one from entty B s recorded as non-strng t (may extract more non-strngs from entty B). The smlartes of each non-strng form of entty B wth t 1 can be calculated by usng formula (1), and the bggest smlarty s taken as the correspondng neural 1176

network nput base on the vsual characterstcs of t. s m( t, t ) 1 1 ( t t) + ( t t) 1 = t (1) there t s the average value of t 1 and t. () Strng type. For the propertes of strng type, the smlarty of semantc block and entty B s strng part can be calculated by usng strng edt dstance method. The method s realzed by countng the mnmum operaton number the source strng m converson to the target strng m j usng nsert, delete and replace. The formula [] s used to convert to the smlarty base on gettng the edt dstance. sm mn( m, mj ) Dm (, mj) ( m, m ) = max(0, ) mn( m, m ) edt j Dm ( 1, mj 1) + Cm ( mj) D( m, mj) = mn D( m 1, m j) + C( m ε ) Dm (, mj 1) + C( ε mj) j () There m means strng semantc block of entty A, m j means characters n the text part entty B, D(m,m j ) means edt dstance, C(m ->m j )means the cost to replace one character of m j from m, C(m ->ε) means to delete one character of m, C( ε->m j ) means to nsert one character nto m j. 3.3 Tran Enttes Recognton Model Ths step uses BP neural network to dentfy Deep Web entty. Frstly, the model of entty recognton needs to be establshed by BP neural network. The advantage of BP neural network s the weght of each attrbute needn t to be calculated. Determnng whether the Deep Web enttes match or not s accordng to tranng nner relatonshp of the attrbutes by the self-study of the BP neural network. Then the sample enttes are dvded nto two categores, namely the set of matchng enttes and the set of unmatched enttes. The two types of enttes are nputted to BP neural network respectvely to tran the weghts and threshold, and the entty dentfy model s generated. Fnally, the entty data that need to be match are nputted nto the traned entty recognton model, then the result wll be generated by the model. Specfc tranng steps are as follows. (1) The entty recognton model s establshed based on BP neural network. There are n nodes n the nput layer, whch are correspondng to the smlartes of n semantc blocks and another entty respectvely, f no correspondng smlarty then enters 0. The output layer has two nodes. When output vector s (1, 0), t represents the entty s matched. Whle the output s (0, 1), t means the entty s unmatched. The number of hdden layer nodes s determned by formula (3). n=sqrt(+j)+d (3) there n means the number of hde layer, means the number of nput layer, j means the number of output layer and d s the constant between [1, 10]. () The sample enttes are dvded to matchng entty set M (r) and the unmatched entty set U (r). Smlartes of the blocks of entty A and entty B are calculated respectvely whch denote as Sm(A,B). (3) The smlarty Sm(A,B) of M(r) entty set s used as the nput of the entty recognton model. The propertes text labels of the semantc block are corresponded wth the nput node. The smlartes value Sm(A,B) are nput to the nodes whch are correspondng to semantc blocks. If there s no correspondng node, the nput value s 0 and the desred output s (1, 0). Neural network s error back propagaton s used f there s hgher error between the actual output and the desred output. The network wll conduct an error back propagaton phase to adjust neural network s weghts and threshold untl the neural network convergence and error precson meet the requrement. (4) The smlarty Sm(A,B) of U(r) entty set s used as an nput, and the desred output s (0,1). The process s same wth step (3), untl the neural network convergence and the error precson meet the requrement; (5) The smlarty Sm(A,B) whch s used to test s nput to the completed entty recognton model and the output S s obtaned. (6) If the error range S n the target model s wthn the scope of M (r), the tested enttes wll be matched. On the other hand the error range of S n the target model s wthn the scope of the U (r), the tested enttes are unmatched. The target mode range s defned as the upper and lower lmts of the tranng set. In the experments of ths paper, for example, the lower lmt of target model M (r) s (0.85, 0.148), and the enttes error range of S ([0.85, 1], [0, 0.148]) means the enttes are matched. The upper lmt of target model U (r) s (0.308, 0.69), and the enttes error range of S ([0, 0.308], [0.69, 1]) means the enttes are unmatched. If the error s between ([0.308, 0.85], [0.148, 0.69]) that means t belongs to nether matched set nor unmatched set, an expert s need for manual recognton. 1177

4. EXPERIMENTS The expermental data s derved from two large webste, namely Amazon and dangdang. Smlar enttes are obtaned from the two webstes by submttng queres to the query nterface of the stes. These enttes are dvded nto tranng set and testng set. Frstly, the enttes are dvded nto semantc blocks. Ths experment gets 1 attrbute Table 1 The Smlarty Of The Tranng Samples text labels correspondng to semantc blocks as a collecton whch denotes K = {ISBN, ttle, author, market prce, prce, dscounts, press, paperback, barcode, ASIN, weght, sze}. The smlartes between each semantc block of any entty and another entty are calculated n the tranng set whch are used as the nput of tranng entty recognton model. The smlarty of the tranng samples s shown n Table 1. Tranng Set No ISBN ttle auth or mark et prce prce dsco unts Press pape rbac k barc ode AS IN weg ht sze Smlartes of M(r) Smlartes of U(r) 1 3 4 5 6 7 8 9 10 0.93 0.9 0.90 0.89 0.91 0.88 0.79 0.85 0.75 0.73 0.70 0.00 0.96 0.90 0.83 0.89 0.80 0.90 0.73 0.79 0.80 0.81 0.70 0.64 0.9 0.89 0.87 0.94 0.80 0.77 0.78 0.8 0.70 0.7 0.00 0.77 0.90 0.93 0.9 0.90 0.88 0.85 0.80 0.71 0.73 0.77 0.69 0.00 0.97 0.91 0.94 0.90 0.9 0.88 0.81 0.75 0.80 0.69 0.71 0.40 0.1 0.0 0.19 0.34 0. 0.15 0.08 0.17 0.11 0.06 0.08 0.01 0.09 0.3 0.1 0.0 0.15 0.19 0.08 0.09 0.07 0.07 0.00 0.09 0.17 0.3 0.18 0. 0.17 0.15 0.09 0.11 0.10 0.07 0.11 0.10 0.1 0.18 0.18 0.13 0.17 0.08 0.09 0.1 0.11 0.09 0.00 0.00 0.9 0.18 0.1 0.10 0.16 0.09 0.07 0.13 0.1 0.04 0.00 0.08 Fnally, the entty recognton model s establshed on the bass of BP neural network. Assume that the maxmum number of teratons of the neural network s not lmted untl the error s to meet the requrements, the objectve error s 0.001, and learnng rate s 0.. There are 1 nput nodes, output nodes of the neural network. The number of nodes n the hdden layer s 10 accordng to formula (3). After tranng the entty recognton model, the lower lmt of the M(r) target mode s (0.85, 0.148), the upper lmt of U(r) target mode s (0.308, 0.69). A par of matched entty A and entty B and a par of unmatched entty A and entty C whch are got n the test set are used to test. Entty A s dvded to blocks and the smlartes are computed respectvely wth entty B and entty C. The results of smlartes computaton are shown n Table. The test result of the entty A and entty B s (0.96, 0.038) whch s located n the range of target mode (0.85, 1] [0, 0.148]) range. So the entty A and entty B match. The test result of entty A and entty C s (0.06, 0.994) whch s located n range of the target mode ([0, 0.308] [0.69, 1]). So the entty A and C s unmatched. IS BN ttle Smlartes of semantc block of A and entty B Smlartes of semantc block of A and entty C Wth the ncreasng of numbers of samples, the accuracy of the model wll ncrease correspondngly. But the amount of tranng samples s too much whch wll result tme complexty ncreased. Ths paper gves a method by the evaluaton ndex of the F value to measure. F value s calculated by formula (4). Performance of ths paper to compare wth the method gven by lterature [5, 6] s shown n Fgure 1. F value of the Table Smlartes Of The Test Enttes auth mark prce dsco press or et unts prce 1178 paper back 0.94 0.93 0.95 0.90 0.88 0.85 0.89 0.81 0.77 0.14 0.79 0.00 0.19 0.3 0.5 0.11 0.30 0.1 0.09 0.10 0.07 0.15 0.06 0.05 bar cod e AS IN weg ht method whch s gven by lterature [5] s only 77% and F value of the method whch s gven by lterature [6] s 83%. But F value of the method whch s gven by ths paper s 93%. The expermental result ndcates that the proposed method n ths paper has better performance. F = (accuracy rate * recall rate * ) / (accuracy rate+ recall rate) (4) sze

100% The method of lterature[5] 80% 60% The method of 40% lterature[6] 0% The method of 0% ths paper 5. CONCLUSION Fgure 1 F Value Comparson The development on great wealth of Internet resources requests hgher on the effectveness and performance of entty recognton. Ths paper makes full use of the advantages of the BP neural network. A Deep Web entty matchng method s proposed on the bass of BP neural network. The method wthout the premse of pattern match reduces the complexty of the entty recognton by enttes blocks whle the smlartes of each the semantc block and another entty are used as the nput of the neural network tranng. The performance of the enttes recognton s mproved by the ndependent tranng of BP neural network. The contrbutons of the proposed method are to mprove the accuracy and the effcency of Deep Web entty recognton, at the same tme reduce the manual nterventon and solve the problem of low automatc level. The above contrbutons are verfed va the experments. ACKNOWLEDGMENTS Ths work was supported n part by Socal Scence Youth Foundaton of Mnstry of Educaton of Chna under Grant Nos. 1YJCZH048, the Natural Scence Foundaton of Laonng Provnce of Chna under Grant Nos. 010083 and Hundred, Thousand and Ten Thousand Talent Project of Laonng Provnce of Chna. REFRENCES: [1] WANG YAN, SONG BAO YAN, ZHANG JIA YANG et al. Deep Web query nterface dentfcaton approach based on label codng [J]. Journal of Computer Applcatons, 011, 31(5): 1351-1354. [] TEJADA SHEILA, KNIBLOCK CRAIG A, MINTON STEVEN. Learnng domanndependent strng transformaton weghts for hgh accuracy object dentfcaton [C]//Proceedngs of the eghth ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng. New York: ACM press, 00: 350-359. [3] CUN JIAN JUN, XIAO HONG YU, DING LI XIN. Dstance-Based Adaptve Record Matchng for Web Databases [J]. Journal of Wuhan Unversty, 01, 58(1): 89-94. [4] SARAWAGI SUNITA, BHAMIDIPATY ANURADHA. Interactve deduplcaton usng actve learnng [C]//Proceedngs of the eghth ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng. New York: ACM press, 00: 69-78. [5] LI WEN SYAN, CLIFTON CHRIS. SEMINT: A Tool for Identfyng Attrbute Correspondences n Heterogeneous Database Usng Neural Networks [J]. Data and Knowledge Engneerng. 000, 33: 49-84. [6] QIANG BAO HUA, CHEN LING, YU JIAN QIAO, et al. Research on Attrbute Matchng Approach Based on Neural Network [J]. Journal of Computer Scence, 006, 33(1): 49-51. 1179