A Novel Optimization Technique for Translation Retrieval in Networks Search Engines

Similar documents
Performance Evaluation of Information Retrieval Systems

Query Clustering Using a Hybrid Query Similarity Measure

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

UB at GeoCLEF Department of Geography Abstract

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Cluster Analysis of Electrical Behavior

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

An Improved Image Segmentation Algorithm Based on the Otsu Method

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Application of Clustering Algorithm in Big Data Sample Set Optimization

Maximum Variance Combined with Adaptive Genetic Algorithm for Infrared Image Segmentation

BIN XIA et al: AN IMPROVED K-MEANS ALGORITHM BASED ON CLOUD PLATFORM FOR DATA MINING

The Research of Support Vector Machine in Agricultural Data Classification

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Improving Web Image Search using Meta Re-rankers

Optimizing Document Scoring for Query Retrieval

Smoothing Spline ANOVA for variable screening

TN348: Openlab Module - Colocalization

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Classifier Selection Based on Data Complexity Measures *

Remote Sensing Image Retrieval Algorithm based on MapReduce and Characteristic Information

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

A Binarization Algorithm specialized on Document Images and Photos

An Image Fusion Approach Based on Segmentation Region

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

Research of Dynamic Access to Cloud Database Based on Improved Pheromone Algorithm

A Method of Hot Topic Detection in Blogs Using N-gram Model

An Optimal Algorithm for Prufer Codes *

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Modular PCA Face Recognition Based on Weighted Average

The Comparison of Calibration Method of Binocular Stereo Vision System Ke Zhang a *, Zhao Gao b

Performance Assessment and Fault Diagnosis for Hydraulic Pump Based on WPT and SOM

Backpropagation: In Search of Performance Parameters

Querying by sketch geographical databases. Yu Han 1, a *

A Model Based on Multi-agent for Dynamic Bandwidth Allocation in Networks Guang LU, Jian-Wen QI

Solving two-person zero-sum game by Matlab

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Model Research on the Optimized and Improved Design of Lucene Search Engine Based on Big Data

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Learning-Based Top-N Selection Query Evaluation over Relational Databases

THE PATH PLANNING ALGORITHM AND SIMULATION FOR MOBILE ROBOT

S1 Note. Basis functions.

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Network Intrusion Detection Based on PSO-SVM

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

An Entropy-Based Approach to Integrated Information Needs Assessment

An IPv6-Oriented IDS Framework and Solutions of Two Problems

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval

PRÉSENTATIONS DE PROJETS

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Chinese Word Segmentation based on the Improved Particle Swarm Optimization Neural Networks

KIDS Lab at ImageCLEF 2012 Personal Photo Retrieval

Image Emotional Semantic Retrieval Based on ELM

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Private Information Retrieval (PIR)

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Available online at Available online at Advanced in Control Engineering and Information Science

A Robust Method for Estimating the Fundamental Matrix

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

Association Rule Mining with Parallel Frequent Pattern Growth Algorithm on Hadoop

Optimal Workload-based Weighted Wavelet Synopses

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Cross-Language Information Retrieval

Face Recognition University at Buffalo CSE666 Lecture Slides Resources:

The Effect of Similarity Measures on The Quality of Query Clusters

Deep Classification in Large-scale Text Hierarchies

Machine Learning: Algorithms and Applications

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

A Clustering Algorithm for Key Frame Extraction Based on Density Peak

Parallel Implementation of Classification Algorithms Based on Cloud Computing Environment

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Professional competences training path for an e-commerce major, based on the ISM method

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

Federated Search of Text Search Engines in Uncooperative Environments

CSCI 5417 Information Retrieval Systems Jim Martin!

Federated Search of Text-Based Digital Libraries in Hierarchical Peer-to-Peer Networks

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Discriminative Dictionary Learning with Pairwise Constraints

Pruning Training Corpus to Speedup Text Classification 1

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

Problem Set 3 Solutions

X- Chart Using ANOM Approach

Object-Based Techniques for Image Retrieval

Feature Selection as an Improving Step for Decision Tree Construction

On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

Local Quaternary Patterns and Feature Local Quaternary Patterns

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

The Study of Remote Sensing Image Classification Based on Support Vector Machine

Robust visual tracking based on Informative random fern

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Hybrid Non-Blind Color Image Watermarking

Transcription:

A Novel Optmzaton Technque for Translaton Retreval n Networks Search Engnes Yanyan Zhang Zhengzhou Unversty of Industral Technology, Henan, Chna Abstract - Ths paper studes models of Translaton Retreval.e. the relatonshp between enqurer s nput words and the retreved nformaton n network search engnes. In order to solve the dffcultes n the tradtonal model, a new mathematcal model s proposed to quantfy the correlaton between web content and user query, and the method s shown by experments to outperform other Translaton Retreval methods. The mproved model s a good soluton to the problems of the tradtonal model, greatly mprovng the query precson and recall rate of search engnes. Keywords - Search engne; Translaton Retreval model; Network searchng engne; Optmzaton. I. INTRODUCTION When a search engne provdes nformaton nqury servce, t only sees the query words. People from dfferent backgrounds may submt the same query words, but are often concerned about dfferent nformaton meanng of those query words. Moreover, the search engne usually does not know the background of the users, so n order not to mss any relevant nformaton, t places the focused nformaton as much as possble n the front of search lst. Ths s a basc requrement for search engnes. Therefore, the core work of a search engne s to sequence the crawled webpages accordng to some factors based on the query words. The three man factors affectng the Translaton Retreval results are the Network searchng engne of webpages, the lnk relatonshp of pages and the user s query ntenton. II. TRANSLATION RETRIEVAL MODEL FRAMEWORK Although there s a varety of Translaton Retreval models, ther status and functon n search engne s the same. Fgure shows a frame of calculaton smlarty of search engne. When the user has nformaton demand, the query words wll be constructed as a concrete manfestaton of the nformaton demand, and the search engne wll construct the nternal query representaton to the user s query words. For the massve web pages or document collecton, there s also correspondng document representaton method nsde the search system. The core of the search engne s to judge whch documents are relevant to user s demand, and to output n a sorted way. So the correlaton calculaton s a process of matchng the user query and document content, and the Translaton Retreval model s a theoretcal bass and core component whch s used to calculate the Network searchng engne. Fgure. Translaton Retreval Model Framework DOI 0.503/IJSSST.a.7.32.55 55. ISSN: 473-804x onlne, 473-803 prnt

III. THE BIM25 MODEL BIM (Bnary Independent Model) only consders whether a word appears n the document or not and does not consder ts own feature. BM25, based on BIM, ntroduces the weght value of the word n the query and the weght value of the word n the document. So, now BM25 model s a comparatvely successful content sortng model. The specfc calculaton method of BM25 model s as shown n the formula (). For each query word appeared n the query Q, ther scores n the document D wll be calculated n turn, and after the accumulaton, comes the correlaton score of document D to query Q. Q ( r 0.5) / ( Rr 0.5) log ( n r 0.5)/( N n Rr 0.5) ( k) f ( k2 ) qf K f k qf In the above formula (( ) dl K k ) b b avdl represents the consderaton of document length. In the calculaton formula of K, dl refers to the length of the document D, and avdl s the average length of all the documents n document collecton, and k and b are emprcal parameters. The parameters b s an adjustment factor, n some extreme cases, f b s set as 0, the document length factor wll not work. Generally, f b s set as 0.75, we wll get a better search effect. Overall, the BM25 model formula actually combnes four factors: the IDF factor, the length factor of document, the word amount of document and the query word frequency; and uses the three free adjustment factor (k, k2 and b) to adjust the weght of varous factors. IV. THE DIFFICULTIES AND SOLUTION A. The Dffcultes There s a dfferent frequency dstrbuton n query words. Qute a number of query words have hardly been quered by the users, whle a small number of query words are repeatedly quered. Ths leads to a problem that numerous relevant query words do not appear n the document, so the generaton probablty of the query word s 0, and ths means that the generaton probablty of the total query s 0. So f a document wth lmted words and content, especally some ndvdual query words do not appear n ths document, t wll lead to a falure to the tradtonal Translaton Retreval model. The problem s called data sparsty of Translaton Retreval model. The query words submtted by the Users may appear n the domans such as page ttle, descrpton nformaton, text, etc. In the calculaton of Network searchng engne, the weghts of the words n the ttle should be greater than 2 () that appear n the text. However, When the tradtonal Translaton Retreval model calculate the correlaton between a document and query, t takes the document as a whole, and not take nto account that dfferent doman gves dfferent weghts. That leads to the precson of Translaton Retreval model droppng and users cannot fnd pages wth whch they are satsfed. B. The Soluton B. Mult-Parameter Data Smoothng Fuson Strategy Ths paper proposes the data smoothng strategy to solve the problem of sparse data. The so-called data smoothng s that takng a part from the dstrbuton probablty value of the words appearng n a document and then assgnng the value to the words whch dd not appear n the document, so all the words have non-zero probablty values and the phenomenon that the whole probablty s zero n the calculaton s avoded. The specfc method s to ntroduce a background probablty to all the words to do data smoothng. The socalled background probablty s to set up a whole language model to document collecton, because of ts relatvely large sze, most of the query has a probablty value. So, for the language model method, f the document collecton contans N documents, t needs to establsh N+ dfferent language models, n whch each document has ts own language model and the data smoothng fuson strategy s establshed on the document collecton language model. f c (2) D C n q, D q ( )( ) (( ) ) PQ D Q D Formula (2) s the formula for calculatng the probablty of document generaton after data smoothng, t can be seen that the probablty of each query word s composed of two parts: model. The frst part The second part ( ) f, q D D s the document language c q s used to make the language C model of document collecton after data smoothng, and the weghts of both can be adjusted by the parameters. The strategy s useful for processng the nvsble words n a query document, especally for the content doman wth only a few words or keywords rarely appeared. The smoothng strategy can ntroduce global nformaton through the overall probablty estmaton, carryng on the revson to the zero probablty and mnmum probablty, whch helps to mprove the language model Translaton Retreval accuracy. DOI 0.503/IJSSST.a.7.32.55 55.2 ISSN: 473-804x onlne, 473-803 prnt

The object treated n the content analyss s the content block of the webpage. As for the representaton of content block, feature vector method s also applcable. Therefore, n calculatng the feature weght, we focus more on ts mportance n a page, but not the statstc mportance n a document collecton. Based on the above analyss, we use formula (3) to calculate the feature weght. W BN j n BN j Where BWe ght BWe ght ( BWe ght BTf ) j j BT j j 2 (3), the weght of the content doman j, s decded by an mportant label of the content doman; BN represents the total number of content domans dvded n webpage; n represents the total number of keywords n webpage; and BTf represents the word frequency that keywords appears n the content doman j. V. EXPERIMENT AND ANALYSIS A. The Optmalty Verfcaton of Language Model Smoothng Strateges Frst s the selecton of data sets, usng 20 Newsgroup data sets and subsets TD2003 and TD2004 of Letor3.0 data sets. In order to test the performance of the Translaton Retreval model proposed n ths paper, the average precson of the man ensemble (MAP) and the normalzed damage cumulatve gan (NDCG) are used as the evaluaton methods. TABLE THE PARAMETERS SELECTED IN DIFFERENT SMOOTHING STRATEGIES Base Lne SVD JM 0.5 20 DIR 50 Newsgroup DIS 0.5 JM 0.7 TD 2003 DIR 2000 DIS 0. JM 0.7 TD 2004 DIR 2000 DIS 0. We select the entre document as a sngle doman, and then take the language model parameters n 20 newsgroup data set as a comparatve test, fnally compare the performance between the mult-parameter fuson sequencng and the sngle optmal parameter sortng of the language model n test set. Table shows the parameters selected n the smoothng strategy n data sets 20 newsgroup, TD 2003 and TD 2004. j 0 parameters are selected as the optonal parameters for each smoothng strategy. The parameters of the Ds and JM smoothng strateges are [0,], so ther parameter set can be set as {0.,0.2,0.3,,}. As for the Dr smoothng method, the selecton of ts parameters s centred on the parameters of the Letor data set. In ths way, we can get 0 page sortng features from each smoothng strategy of the language Translaton Retreval model. In ths experment, the MAP value s taken as an ndcator for evaluatng the performance of fuson method, and the expermental results of language Translaton Retreval models based on smoothng strategy are ganed, whch s shown n Table 2. TABLE 2 MULTI PARAMETER LANGUAGE MODEL SMOOTHING METHOD FUSION 0-feature SP MP Gan( %) NEWS_Jm 0.4336 0.4340 0.09 NEWS_Dr 0.4302 0.4330 0.65 NEWS_Ds 0.4360 0.4379 0.44 TD3_Jm 0.76 0.300 0.54 TD3_Dr 0.0668 0.047 56.74 TD3_Ds 0.230 0.305 6.0 TD4_Jm 0.346 0.36. TD4_Dr 0.0853 0.283 50.4 TD4_Ds 0.45 0.429 0.99 Comparng the expermental results n Table 2, t can be seen that the expermental result of mult-parameter language model smoothng method s superor to the sngle parameter language model smoothng method, especally n TD 2003 data set. The SP method shows the general level of sortng. It also llustrates that there s strong complementarty between the mult-parameter sortng features, whch can greatly mprove the sortng effect. B. Performance Verfcaton of Feature Weghts Comprehensve Sortng n Dfferent Domans of Pages In ths experment, the classfer developed by Bejng Unversty network laboratory s taken as the basc classfer. And the tradtonal precson rato, recall rate and F value are adopted to evaluate the classfcaton results. When a user makes a certan search request, the search system wll always return the relevant documents systematcally to the user. For such search behavour, we can dvde a document collecton nto four dsjont subsets accordng to two dmensons, as s shown n Fgure 3. DOI 0.503/IJSSST.a.7.32.55 55.3 ISSN: 473-804x onlne, 473-803 prnt

On the bass of dvdng the document set nto 4 subsets, we can quanttatvely descrbe the precson rate, recall rate and F value. The followng three formulas are the calculaton methods of these three ndexes. pr ec s on N N M recall N N K Fgure 3 Understandng the two dmensons of document collecton In fgure 3, ) N represents the document whch s n the results of ths search and related to the search request. 2) M represents the document whch s n the results of ths search but not related to the search request. 3) K represents the document whch s out of the results of ths search but related to the search request. 4) L represents the document whch s out of the results of ths search and not related to the search request.. F 2 pr ec s on r ecal l pr ec s on r ecal l We can use the above three formulas to calculate those three ndexes of dfferent categores of the documents n data set. Fgure 4 s a performance comparson between the old classfer and the new one, n whch, the horzontal axs represents the dfferent category numbers, and table 3 shows the correspondng meanng to each category number n Fgure 4. Fgure 4 Comparson of classfcaton results before and after web page cleanng Category Numbers Class Names Category Numbers Class Names TABLE 3 THE CHECK LIST OF CATEGORY NUMBERS 2 3 4 5 6 Humanty News Meda Busness Economy Entertanment and Lesure IT Educaton 7 8 9 0 2 Toursm Natural Scence Government Poltcs Socal Scence Health Care Socal Culture Through Fgure 4 we can see that all the classfcaton results of categores get mproved than that before. In addton, when those webpages n tranng set and testng set are selected manually, they are supposed to be the DOI 0.503/IJSSST.a.7.32.55 55.4 ISSN: 473-804x onlne, 473-803 prnt

pages as far as possble wth more text nformaton and less nose nformaton. Therefore, the purfcaton effect of web page n the practcal applcaton s more obvous than the results of ths experment. VI. CONCLUSION Ths paper optmzes the tradtonal Translaton Retreval model based on Network searchng engne and fnds an effectve soluton to the problems of data sparsty and equalty of weghts of dfferent domans n tradtonal model. The mproved model can effectvely promote the precson and recall rate of search engne, whch provdes a method and a theoretcal prncple for the development of search engne. REFERENCES [] Z.J. Yang. Research and applcaton of personalzed query expanson technology of search engne. Natonal Unversty of Defense Technology. Changsha Chna(200) [2] J. Guo, H. Guo, and Z. Wang. An Actvaton Force-based Affnty Measure foranalyzng Complex Networks. Sc. Rep. Vol., No.7,9-2(20) [3] H. Zhao, C.S. Ba and S.Zhu. Automatc keyword extracton algorthm and mplementaton. App. Mech. Mater. Vol.44, 404-4049(20) [4] X.Q.Ja. Topc nformaton acquston system based on an mproved ant-spoofng topc crawler algorthm. Int. J. Dg. Con. Tech. App. Vol. 6, No.6, 290-297(202) [5] Saraswath D, Kathravan A V, Kavtha R. A new enhanced technque for lnk farm detecton. Info. Med. Eng. (PRIME). Vol.2, 74-8(202) [6] Z.M. He, L.H. Wang, G. Zhang. An mproved pagerank algorthm wth ant-lnk spam. J. Chn. Inf. Vol.26,No.5,0-06(202) [7] D.X. Lu, X. Yan, W. Xe. Improved pagerank algorthm based on the resdence tme of the webste. Int. Comput. Appl. Vol.4, No.5, 60-607(202) [8] H. Huang, L. Qan and Y. Wang. A SVM-based technque to detect phshng URLs. Inf. Tec. J. Vol., No.7, 92-925(202) [9] X.He, Z.X.Nu, J.Y.Sun.The effect of context on user search behavor. J. Int. Vol.3, No.0, 22-25(202) [0] L.Dong, H.W.Xe. Study on optmzaton of rank fuson algorthm n meta search engne. Comput. Appl. Software. Vol. 29, No.0, 88-90(202) [] Parra A J, Forne M J, Rebollo M D. Prvacy protecton of user profles n personalzed nformaton systems. U. Polt. Catal. Vol. 33, No.2, 53-63(203) [2] L. Shou, H. Ba, K. Chen. Supportng prvacy protecton n personalzed Web search. IEEE. T. Knowl. Data. En. Vol.26, No.2, 453-467(204) [3] C.Z.L. Research on the personalzed servce of search engne and ts models under Web2.0 envronment. Inform. Sc. Vol.35, No.3,75-79(205) [4] H.W.Wang, W. Wang, M. Yuan. Counterng page rankng spam based on text content and lnk structure analyss. Syst. Eng. Th. Pract. Vol.35, No.2, 445-457(205) DOI 0.503/IJSSST.a.7.32.55 55.5 ISSN: 473-804x onlne, 473-803 prnt