A Novel Optimization Technique for Translation Retrieval in Networks Search Engines

A Novel Optmzaton Technque for Translaton Retreval n Networks Search Engnes Yanyan Zhang Zhengzhou Unversty of Industral Technology, Henan, Chna Abstract - Ths paper studes models of Translaton Retreval.e. the relatonshp between enqurer s nput words and the retreved nformaton n network search engnes. In order to solve the dffcultes n the tradtonal model, a new mathematcal model s proposed to quantfy the correlaton between web content and user query, and the method s shown by experments to outperform other Translaton Retreval methods. The mproved model s a good soluton to the problems of the tradtonal model, greatly mprovng the query precson and recall rate of search engnes. Keywords - Search engne; Translaton Retreval model; Network searchng engne; Optmzaton. I. INTRODUCTION When a search engne provdes nformaton nqury servce, t only sees the query words. People from dfferent backgrounds may submt the same query words, but are often concerned about dfferent nformaton meanng of those query words. Moreover, the search engne usually does not know the background of the users, so n order not to mss any relevant nformaton, t places the focused nformaton as much as possble n the front of search lst. Ths s a basc requrement for search engnes. Therefore, the core work of a search engne s to sequence the crawled webpages accordng to some factors based on the query words. The three man factors affectng the Translaton Retreval results are the Network searchng engne of webpages, the lnk relatonshp of pages and the user s query ntenton. II. TRANSLATION RETRIEVAL MODEL FRAMEWORK Although there s a varety of Translaton Retreval models, ther status and functon n search engne s the same. Fgure shows a frame of calculaton smlarty of search engne. When the user has nformaton demand, the query words wll be constructed as a concrete manfestaton of the nformaton demand, and the search engne wll construct the nternal query representaton to the user s query words. For the massve web pages or document collecton, there s also correspondng document representaton method nsde the search system. The core of the search engne s to judge whch documents are relevant to user s demand, and to output n a sorted way. So the correlaton calculaton s a process of matchng the user query and document content, and the Translaton Retreval model s a theoretcal bass and core component whch s used to calculate the Network searchng engne. Fgure. Translaton Retreval Model Framework DOI 0.503/IJSSST.a.7.32.55 55. ISSN: 473-804x onlne, 473-803 prnt

III. THE BIM25 MODEL BIM (Bnary Independent Model) only consders whether a word appears n the document or not and does not consder ts own feature. BM25, based on BIM, ntroduces the weght value of the word n the query and the weght value of the word n the document. So, now BM25 model s a comparatvely successful content sortng model. The specfc calculaton method of BM25 model s as shown n the formula (). For each query word appeared n the query Q, ther scores n the document D wll be calculated n turn, and after the accumulaton, comes the correlaton score of document D to query Q. Q ( r 0.5) / ( Rr 0.5) log ( n r 0.5)/( N n Rr 0.5) ( k) f ( k2 ) qf K f k qf In the above formula (( ) dl K k ) b b avdl represents the consderaton of document length. In the calculaton formula of K, dl refers to the length of the document D, and avdl s the average length of all the documents n document collecton, and k and b are emprcal parameters. The parameters b s an adjustment factor, n some extreme cases, f b s set as 0, the document length factor wll not work. Generally, f b s set as 0.75, we wll get a better search effect. Overall, the BM25 model formula actually combnes four factors: the IDF factor, the length factor of document, the word amount of document and the query word frequency; and uses the three free adjustment factor (k, k2 and b) to adjust the weght of varous factors. IV. THE DIFFICULTIES AND SOLUTION A. The Dffcultes There s a dfferent frequency dstrbuton n query words. Qute a number of query words have hardly been quered by the users, whle a small number of query words are repeatedly quered. Ths leads to a problem that numerous relevant query words do not appear n the document, so the generaton probablty of the query word s 0, and ths means that the generaton probablty of the total query s 0. So f a document wth lmted words and content, especally some ndvdual query words do not appear n ths document, t wll lead to a falure to the tradtonal Translaton Retreval model. The problem s called data sparsty of Translaton Retreval model. The query words submtted by the Users may appear n the domans such as page ttle, descrpton nformaton, text, etc. In the calculaton of Network searchng engne, the weghts of the words n the ttle should be greater than 2 () that appear n the text. However, When the tradtonal Translaton Retreval model calculate the correlaton between a document and query, t takes the document as a whole, and not take nto account that dfferent doman gves dfferent weghts. That leads to the precson of Translaton Retreval model droppng and users cannot fnd pages wth whch they are satsfed. B. The Soluton B. Mult-Parameter Data Smoothng Fuson Strategy Ths paper proposes the data smoothng strategy to solve the problem of sparse data. The so-called data smoothng s that takng a part from the dstrbuton probablty value of the words appearng n a document and then assgnng the value to the words whch dd not appear n the document, so all the words have non-zero probablty values and the phenomenon that the whole probablty s zero n the calculaton s avoded. The specfc method s to ntroduce a background probablty to all the words to do data smoothng. The socalled background probablty s to set up a whole language model to document collecton, because of ts relatvely large sze, most of the query has a probablty value. So, for the language model method, f the document collecton contans N documents, t needs to establsh N+ dfferent language models, n whch each document has ts own language model and the data smoothng fuson strategy s establshed on the document collecton language model. f c (2) D C n q, D q ( )( ) (( ) ) PQ D Q D Formula (2) s the formula for calculatng the probablty of document generaton after data smoothng, t can be seen that the probablty of each query word s composed of two parts: model. The frst part The second part ( ) f, q D D s the document language c q s used to make the language C model of document collecton after data smoothng, and the weghts of both can be adjusted by the parameters. The strategy s useful for processng the nvsble words n a query document, especally for the content doman wth only a few words or keywords rarely appeared. The smoothng strategy can ntroduce global nformaton through the overall probablty estmaton, carryng on the revson to the zero probablty and mnmum probablty, whch helps to mprove the language model Translaton Retreval accuracy. DOI 0.503/IJSSST.a.7.32.55 55.2 ISSN: 473-804x onlne, 473-803 prnt

The object treated n the content analyss s the content block of the webpage. As for the representaton of content block, feature vector method s also applcable. Therefore, n calculatng the feature weght, we focus more on ts mportance n a page, but not the statstc mportance n a document collecton. Based on the above analyss, we use formula (3) to calculate the feature weght. W BN j n BN j Where BWe ght BWe ght ( BWe ght BTf ) j j BT j j 2 (3), the weght of the content doman j, s decded by an mportant label of the content doman; BN represents the total number of content domans dvded n webpage; n represents the total number of keywords n webpage; and BTf represents the word frequency that keywords appears n the content doman j. V. EXPERIMENT AND ANALYSIS A. The Optmalty Verfcaton of Language Model Smoothng Strateges Frst s the selecton of data sets, usng 20 Newsgroup data sets and subsets TD2003 and TD2004 of Letor3.0 data sets. In order to test the performance of the Translaton Retreval model proposed n ths paper, the average precson of the man ensemble (MAP) and the normalzed damage cumulatve gan (NDCG) are used as the evaluaton methods. TABLE THE PARAMETERS SELECTED IN DIFFERENT SMOOTHING STRATEGIES Base Lne SVD JM 0.5 20 DIR 50 Newsgroup DIS 0.5 JM 0.7 TD 2003 DIR 2000 DIS 0. JM 0.7 TD 2004 DIR 2000 DIS 0. We select the entre document as a sngle doman, and then take the language model parameters n 20 newsgroup data set as a comparatve test, fnally compare the performance between the mult-parameter fuson sequencng and the sngle optmal parameter sortng of the language model n test set. Table shows the parameters selected n the smoothng strategy n data sets 20 newsgroup, TD 2003 and TD 2004. j 0 parameters are selected as the optonal parameters for each smoothng strategy. The parameters of the Ds and JM smoothng strateges are [0,], so ther parameter set can be set as {0.,0.2,0.3,,}. As for the Dr smoothng method, the selecton of ts parameters s centred on the parameters of the Letor data set. In ths way, we can get 0 page sortng features from each smoothng strategy of the language Translaton Retreval model. In ths experment, the MAP value s taken as an ndcator for evaluatng the performance of fuson method, and the expermental results of language Translaton Retreval models based on smoothng strategy are ganed, whch s shown n Table 2. TABLE 2 MULTI PARAMETER LANGUAGE MODEL SMOOTHING METHOD FUSION 0-feature SP MP Gan( %) NEWS_Jm 0.4336 0.4340 0.09 NEWS_Dr 0.4302 0.4330 0.65 NEWS_Ds 0.4360 0.4379 0.44 TD3_Jm 0.76 0.300 0.54 TD3_Dr 0.0668 0.047 56.74 TD3_Ds 0.230 0.305 6.0 TD4_Jm 0.346 0.36. TD4_Dr 0.0853 0.283 50.4 TD4_Ds 0.45 0.429 0.99 Comparng the expermental results n Table 2, t can be seen that the expermental result of mult-parameter language model smoothng method s superor to the sngle parameter language model smoothng method, especally n TD 2003 data set. The SP method shows the general level of sortng. It also llustrates that there s strong complementarty between the mult-parameter sortng features, whch can greatly mprove the sortng effect. B. Performance Verfcaton of Feature Weghts Comprehensve Sortng n Dfferent Domans of Pages In ths experment, the classfer developed by Bejng Unversty network laboratory s taken as the basc classfer. And the tradtonal precson rato, recall rate and F value are adopted to evaluate the classfcaton results. When a user makes a certan search request, the search system wll always return the relevant documents systematcally to the user. For such search behavour, we can dvde a document collecton nto four dsjont subsets accordng to two dmensons, as s shown n Fgure 3. DOI 0.503/IJSSST.a.7.32.55 55.3 ISSN: 473-804x onlne, 473-803 prnt

On the bass of dvdng the document set nto 4 subsets, we can quanttatvely descrbe the precson rate, recall rate and F value. The followng three formulas are the calculaton methods of these three ndexes. pr ec s on N N M recall N N K Fgure 3 Understandng the two dmensons of document collecton In fgure 3, ) N represents the document whch s n the results of ths search and related to the search request. 2) M represents the document whch s n the results of ths search but not related to the search request. 3) K represents the document whch s out of the results of ths search but related to the search request. 4) L represents the document whch s out of the results of ths search and not related to the search request.. F 2 pr ec s on r ecal l pr ec s on r ecal l We can use the above three formulas to calculate those three ndexes of dfferent categores of the documents n data set. Fgure 4 s a performance comparson between the old classfer and the new one, n whch, the horzontal axs represents the dfferent category numbers, and table 3 shows the correspondng meanng to each category number n Fgure 4. Fgure 4 Comparson of classfcaton results before and after web page cleanng Category Numbers Class Names Category Numbers Class Names TABLE 3 THE CHECK LIST OF CATEGORY NUMBERS 2 3 4 5 6 Humanty News Meda Busness Economy Entertanment and Lesure IT Educaton 7 8 9 0 2 Toursm Natural Scence Government Poltcs Socal Scence Health Care Socal Culture Through Fgure 4 we can see that all the classfcaton results of categores get mproved than that before. In addton, when those webpages n tranng set and testng set are selected manually, they are supposed to be the DOI 0.503/IJSSST.a.7.32.55 55.4 ISSN: 473-804x onlne, 473-803 prnt

pages as far as possble wth more text nformaton and less nose nformaton. Therefore, the purfcaton effect of web page n the practcal applcaton s more obvous than the results of ths experment. VI. CONCLUSION Ths paper optmzes the tradtonal Translaton Retreval model based on Network searchng engne and fnds an effectve soluton to the problems of data sparsty and equalty of weghts of dfferent domans n tradtonal model. The mproved model can effectvely promote the precson and recall rate of search engne, whch provdes a method and a theoretcal prncple for the development of search engne. REFERENCES [] Z.J. Yang. Research and applcaton of personalzed query expanson technology of search engne. Natonal Unversty of Defense Technology. Changsha Chna(200) [2] J. Guo, H. Guo, and Z. Wang. An Actvaton Force-based Affnty Measure foranalyzng Complex Networks. Sc. Rep. Vol., No.7,9-2(20) [3] H. Zhao, C.S. Ba and S.Zhu. Automatc keyword extracton algorthm and mplementaton. App. Mech. Mater. Vol.44, 404-4049(20) [4] X.Q.Ja. Topc nformaton acquston system based on an mproved ant-spoofng topc crawler algorthm. Int. J. Dg. Con. Tech. App. Vol. 6, No.6, 290-297(202) [5] Saraswath D, Kathravan A V, Kavtha R. A new enhanced technque for lnk farm detecton. Info. Med. Eng. (PRIME). Vol.2, 74-8(202) [6] Z.M. He, L.H. Wang, G. Zhang. An mproved pagerank algorthm wth ant-lnk spam. J. Chn. Inf. Vol.26,No.5,0-06(202) [7] D.X. Lu, X. Yan, W. Xe. Improved pagerank algorthm based on the resdence tme of the webste. Int. Comput. Appl. Vol.4, No.5, 60-607(202) [8] H. Huang, L. Qan and Y. Wang. A SVM-based technque to detect phshng URLs. Inf. Tec. J. Vol., No.7, 92-925(202) [9] X.He, Z.X.Nu, J.Y.Sun.The effect of context on user search behavor. J. Int. Vol.3, No.0, 22-25(202) [0] L.Dong, H.W.Xe. Study on optmzaton of rank fuson algorthm n meta search engne. Comput. Appl. Software. Vol. 29, No.0, 88-90(202) [] Parra A J, Forne M J, Rebollo M D. Prvacy protecton of user profles n personalzed nformaton systems. U. Polt. Catal. Vol. 33, No.2, 53-63(203) [2] L. Shou, H. Ba, K. Chen. Supportng prvacy protecton n personalzed Web search. IEEE. T. Knowl. Data. En. Vol.26, No.2, 453-467(204) [3] C.Z.L. Research on the personalzed servce of search engne and ts models under Web2.0 envronment. Inform. Sc. Vol.35, No.3,75-79(205) [4] H.W.Wang, W. Wang, M. Yuan. Counterng page rankng spam based on text content and lnk structure analyss. Syst. Eng. Th. Pract. Vol.35, No.2, 445-457(205) DOI 0.503/IJSSST.a.7.32.55 55.5 ISSN: 473-804x onlne, 473-803 prnt