Improved Structured Robustness (I-SR): A Novel Approach to Predict Hard Keyword Queries

Size: px

Start display at page:

Download "Improved Structured Robustness (I-SR): A Novel Approach to Predict Hard Keyword Queries"

Willis Shaw
5 years ago
Views:

1 Journal of Scientific & Industrial Research Vol. 76, January 2017, pp Improved Structured Robustness (I-SR): A Novel Approach to Predict Hard Keyword Queries M S Selvi, K Deepa, M S Sangari* and B Mohankumar Department of IT, Sri Ramakrishna Engineering College, Coimbatore, India Received 03 November 2015; revised 07 August 2016; accepted 10 October 2016 Keyword queries on databases provide easy access to data, but often suffer from low ranking quality, i.e., low precision and/or recall, as shown in recent benchmarks. It would be useful to identify queries that are likely to have low ranking quality to improve the user satisfaction. For instance, the system may suggest to the user alternative queries for such hard queries. In the existing work, analyzes the characteristics of hard queries and propose a novel framework to measure the degree of difficulty for a keyword query over a database, considering both the structure and the content of the database and the query results. However, in this system numbers of issues are there to address. One of the main issues present in the existing work is that, at the time keyword prediction only user submitted keyword will be used for the prediction of the results. The existing work won t concentrate about the semantic meaning present among the key words that are submitted by the users, which will lead to inaccurate result retrieval. To overcome this problem in the proposed work, the semantic based key word prediction is proposed by using ontology-based representation in which the semantic meaning of the keywords will be analyzed by using the Word Net tool. This will lead to an accurate to k retrieval of document due to consideration of the semantic meaning of the documents in search engine. Keywords: Query Search, IR ranking, Structured Robustness, Ontological Annotation Introduction Applications in which plain text coexists with structured data are pervasive. Furthermore, text and structured data are often stored side by side within standard relational database management systems (RDBMSs). Free-form keyword search over RDBMSs has lead to the way of implementing the Ontological search over multiple databases. Given a keyword query, systems such as DBXplorer and DISCOVER join tuples from multiple relations in the database to identify tuple trees with all the query keywords ( AND semantics). All such tuple trees are the answer to the query. Keyword Query Interfaces (KQIs) for databases are used in searching and exploring this data. KQIs identify the information needed behind search of each keyword query. The result of the query is being ranked based on how they satisfy the user intension for searching. The relational databases are searched using Structured Query Language (SQL). The internet users are not able to browse through the complete answer set and are in a urge of getting their search result in the top 10 searches. Keyword search is an efficient *Author for Correspondence senthamilselvi@srec.ac.in mechanism to handle the querying in textual document systems and World Wide Web. The database research community is playing a vital role in introducing keyword search techniques in relational databases XML databases graph databases and heterogeneous data sources, structured and unstructured databases. Keyword search over structured and semi-structured data needs to extract data from different locations which are interconnected and collectively related to query. Given a database D of m objects, each of which is to describe the character by n attributes, a scoring function f, in agreement with to which we rank the object the database D, then a top-k query Q returns the k objects with the highest rank in f efficiently. Instead of all answers, a top-k query returns the subset of most related answers. This leads to the increase in efficient information retrieval techniques for the query search in databases. The notion of ranking robustness (Yun Zhou, 2006), refers to a property of a ranked list of documents that indicates how stable the ranking is in the presence of uncertainty in the ranked documents. The robustness score significantly and consistently correlates with query performance in a variety of TREC test collections including the GOV2 collection. Query performance predictors (Claudia Hauf, 2010)

2 SANGARI et al.: IMPROVED STRUCTURED ROBUSTNESS 39 are evaluated by reporting correlation coefficients to denote how well the methods perform at predicting the retrieval performance of a set of queries but does not concentrate on how strong does the correlation need to be.this issue can be addressed using the context of two settings: Selective Query Expansion and Meta-Search. Here the quality of a predictor is being considered, in order to examine the strength of the correlation achieved, also how it affects the effectiveness of an adaptive retrieval system. The results of this study show that many existing predictors fail to achieve a correlation strong enough to be reliable to improve the retrieval effectiveness in the Selective Query Expansion as well as the Meta- Search setting. Keyword query interfaces (KQIs) (Shiwen Cheng, 2014) for databases provide easy access to data, but often suffer from low ranking quality, i.e. low precision and/or recall, as shown in recent benchmarks. The characteristics of hard queries is analyzed and Shiwen Cheng has proposed a novel framework to measure the degree of difficulty for a keyword query over a database, considering both the structure and the content of the database and the query results. It s being evaluated using keyword search on databases, INEX and SemSearch using prediction algorithms. The evaluation of state of the art query prediction algorithms, both post-retrieval and preretrieval is analyzed based on their sensitivity towards the retrieval algorithm (C. Hauff) by improving the clarity, and demonstrated that it outperforms state-ofthe-art predictors on three standard collections, including two large Web collections. Applications in which plain text coexists with structured data are pervasive. Commercial relational database management systems (RDBMSs) generally provide querying capabilities for text attributes that incorporate stateof-the-art information retrieval (IR) relevance ranking strategies, but this search functionality requires that queries specify the exact column or columns against which a given list of keywords is to be matched (Vagelis Hristidis, 2013). IR-style documentrelevance ranking strategies help to solve the problem of processing free-form keyword queries over RDBMSs by handling queries with both AND and OR semantics, and exploits the sophisticated singlecolumn text-search functionality often available in commercial RDBMSs. Ontology Based Approach for Domain Specific Semantic Information Retrieval Fig.1 Architechture of I-SR algorithm

3 40 J SCI IND RES VOL 76 JANUARY 2017 System (Pratibha. S, 2014) search the required document using conceptual search technique, where the system understands the meanings of the concepts, finds the relations between concepts of users specified queries and then retrieves the semantic answer. This conceptual search technique is implemented by using the concept of ontology. Using this concept, query expansion is performed by converting the input query into a SPARQL query. SPARQL is an RDF database language. SPARQL query is then fired on to the RDF database and accesses the relevant information. Proposed Methodology The Proposed methodology focus on the keyword query search process using ontology with wordnet tool. This paper suggests an efficient Improved Structured Robust (I-SR) Algorithm which provide fast and efficient keyword query result based on the wordnet tool. When the query is being given the meaning of the query is analyzed along with the related information for corresponding query. If the user query includes special characters such as +,space and - for indicating the conjunctive and exclusive situations, those characters are not eliminated and provides semantic meaning for s. Architecture of I-SR Algorithm Initially the original dataset, which contains result set for user queries, has to be stored in the corresponding database. So, each time when user put forth a query search, the query is taken into database to retrieve the corresponding answer for query. In our proposed method, we additionally used WordNet tool for obtaining the useful description about particular Query. The top k result is obtained through PRMS ranking method for semi-structured data. Each time the new keyword instances are underlined, the changes will be done by PRMS ranking method. PRMS utilizes a language model approach to search over structured data. It computes the language model of each attribute value smoothed by the language model of its attribute. It assigns each attribute a query keyword-specific weight, which specifies its contribution in the ranking score. The ranking of the search results is being improved by using ranking algorithm based on the robustness to increase the semantic search answers. The time to retrieve the search data on original and corrupted database is being calculated. Now re-rank the top k results for specific user query. Use the Improved SR algorithm to reduce the overhead time to process the query. This algorithm also focuses on the measurement of difficulty of a query and rankings. It is based on the differences between the rankings of the same query over the original and noisy (corrupted) versions of the same database. I-SR algorithm handles efficiently the corrupted entity, attributes and attribute value using robust score value. So we have to calculate the I-SR score for each user query and result from the database. Finally, we obtain the result after the performance of spearman correlation. Procedure Noise Generation in Databases In order to compute SR, we need to define the noise generation model fxdb (M) for database DB. It shows that each attribute value is corrupted by a combination of three corruption levels: on the value itself, its attribute and its entity set. Now the details: Since the ranking methods for queries over structured data do not generally consider the terms in V that do not belong to query Q, we consider their frequencies to be the same across the original and noisy versions of DB. Given query Q, let be a vector that contains term frequencies for terms w Q V. Similarly, we simplify our model by assuming the attribute values in DB and the terms in Q V are independent. The corruption model must reflect the challenges about search on structured data, where it captures the statistical properties of the query keywords in the attribute values, attributes and entity sets. The introduction of content noise to the attributes and entity sets will propagate down to the attribute values. For instance, if an attribute value of attribute title contains keyword Godfather, then Godfather may appear in any attribute value of attribute title in a corrupted database instance. Similarly, if Godfather appears in an attribute value of entity set movie, then Godfather may appear in any attribute value of entity set movie in a corrupted instance. Ranking in Original Database With the mapping probabilities estimated, the probabilistic retrieval model for semi-structured data (PRMS) can use them as weights for combining the score from each element into a document score, as follows: (1) Here, the mapping probability P M (Ej w) is calculated and the element-level query-likelihood score P QL (w ej) is estimated in the same way as in the HLM approach.

4 SANGARI et al.: IMPROVED STRUCTURED ROBUSTNESS 41 (2) (3) The rationale behind this weighting is that the mapping probability is the result of the inference procedure to decide which element the user may have meant for a given query term. For instance, for the query term `romance', this model assigns higher weight when it is found in genre element as we assume that the user is more likely to have meant a type of movie rather than a word found in plot. One may imagine a case where the user meant `meg ryan' to be words in the title and `romance' to be in the plot. Given that our goal is to make the best guess with the minimal information supplied by user, however, the PRMS will not rank movies that match this interpretation as highly as the more common meaning. Movies that do match this interpretation will, however, appear in the ranking rather than being rejected outright, which would be the case if we were generating structured queries. The experimental results based on collections and queries taken from the actual web services support the claim that the common interpretation is usually correct. Ranking in Corrupted Database With the mapping, probabilities estimated as described above, the probabilistic retrieval model for semi-structured data (PRMS) can use them as weights for combining the score from each element into a document score for same equation. Here, the mapping probability P M (Ej w) is calculated and the elementlevel query-likelihood score P QL (w ej) is estimated in the same way as in the HLM approach. The rationale behind this weighting is that the mapping probability is the result of the inference procedure to decide which element the user may have meant for a given query term. For instance, for the query term `romance', this model assigns higher weight when it is found in genre element as we assume that the user is more likely to have meant a type of movie rather than a word found in plot. One may imagine a case where the user meant `meg ryan' to be words in the title and `romance' to be in the plot. Given that our goal is to make the best guess with the minimal information supplied by user, however, the PRMS will not rank movies that match this interpretation as highly as the more common meaning. Movies that do match this interpretation will, however, appear in the ranking rather than being rejected outright which would be the case if we were generating structured queries. The experimental results based on collections and queries taken from the actual web services support the claim that the common interpretation is usually correct. Improved Structured Robustness (I-SR) Algorithm We compute the similarity of the answer lists using Spearman rank correlation. It ranges between 1 and 1, where 1, 1, and 0 indicate perfect positive correlation, perfect negative correlation, and almost no correlation, respectively. To computes the Structured Robustness score (SR score), for query Q over database DB given retrieval function g: SR(Q,g,DB,XDB)=E{Sim(L(Q,g,DB),L(Q,g,XDB))} (4) where Sim denotes the Spearman rank correlation between the ranked answer lists. Pseudocode : I-SR Algorithm 1 Consider the input query Q, Inverted Index I, Number of relations exist in the ontology R, Finite Set O (Ontology), Similarity word, Top k result List L of Q by ranking function g, Number of corruption iteration N. 2 ISR 0 ; C {}; 3 For i=1 N do 4 For (int i=0; i<word length; i++); 5 WordInformation[i]=Find WordInformation for Words(i) by Wordnet 6 For (j=1; j<r;j++) 7 R= // no.of relations exist in ontology concept. 8 For each concept of ontology 9 If Type (word Type.word) is a noun then 10 Word Distance = word Type. word Get Similarity 11 (concept of ontology), return 12 Build similarity matrix 13 Improve the relevance score value based on the web page counts. 14 Then do the same process for corrupted database. 15 For each result R in L do 16 For each attribute value A in R do 17 Obtain corrupted version of A. 18 For each keyword w in Q do 19 Calculate number of w in corrupted A. 20 Perform for all words in a given query. 21 Read the character after +, - and hyphen using I-SR algorithm.

42 J SCI IND RES VOL 76 JANUARY 2017 22 Update all the metadata values. 23 Compute the ranking using function g and correlation method.

The INEX 2010 data centric database set is used and the following parameters are being analyzed.

5 42 J SCI IND RES VOL 76 JANUARY Update all the metadata values. 23 Compute the ranking using function g and correlation method. 24 Obtain ISR result with semantic result using ontology concept. 25 Return ISR+=Semantic and Similarity top k result (L, ) Simulation and Results The simulation is done using java. The INEX 2010 data centric database set is used and the following parameters are being analyzed. Recall Recall value is calculated based on the retrieval of information at true positive prediction, false negative. Data precision is calculated based on the percentage of positive results returned and is also referred to as the True Positive Rate. proposed system. Finally we can prove that our proposed scenario is efficient to existing algorithms. Conclusions and Future Work In the existing work, analyzes the characteristics of hard queries and propose a novel framework to measure the degree of difficulty for a keyword query over a database, considering both the structure and the content of the database and the query results. However, in this system numbers of issues are there to address. They are, searching quality is lower than the other system and reliability rate of the system is lowest. In order to overcome these drawbacks, we are proposing the improved ranking algorithm which is used to enhance the accuracy rate of the system. This proposed system is well enhancing the reliability Recall = (6) We use Improved Structured Robustness algorithm in proposed scenario to build high accurate results. For number of queries it generates the higher recall values in current scenario. Finally we can say that our proposed system is higher performance than existing scenario. Precision Precision value is calculated is based on the retrieval of information at true positive prediction, false positive. Fig.2 Precision parameter graph Precision = (7) From this figure given below, we can conclude that our proposed scenario produce maximum precision values rather than our existing scenario. By using I-SR method, proposed system achieves high precision values compare than SR method in existing system. Thus, we achieve the result in greater for proposed system rather than the existing system. Time Parameter From the graph generated, it is being found that the time taken is long in existing scenario and low in proposed scenario. We use ontology with wordnet tool hence it extracts the necessity answer set fast and efficient with relevant top k results. By using proposed I-SR with ontology concept, it takes minimum amount of time for computation in Fig.3 Time Parameter Graph

6 SANGARI et al.: IMPROVED STRUCTURED ROBUSTNESS 43 rate of the difficult query prediction system. Some query topics contain characters +" and " to indicate the conjunctive and exclusive conditions. In the existing system, we do not use these conditions and remove the keywords after character ". In our proposed difficult query prediction system, use these operators to improve search quality. In other words, this work is support these operators for efficient result. Additionally we proposed the concept of ontology based Word Net tool for improving the top k results efficiency. It is used for providing not only similar words but also semantic meaning for the given word query. From the experimentation result, we are obtaining the proposed system is well effective than the existing system by means of accuracy rate, quality of result and short threshold time. References 1 Cheng S, Termehchy A & Hristidis V, Efficient prediction of difficult keyword queries over databases, IEEE Transactions on Knowledge and Data Engineering, Vol 26, no 6 (2014) Hristidis V, Gravano L & Papakonstantinou Y, Efficient IR-style keyword searches over relational databases, Proc 29th Int VLDB Conf, Berlin, Germany, (2003) Ganti V, He Y & Xin D, Keyword++: A framework to improve keyword search over entity databases, Proc VLDB Endowment, Singapore, Vol 3, no 1 2, (2010) Bhalotia G, Hulgeri A, Nakhe C, Chakrabarti S & Sudarshan S, Keyword searching and browsing in databases using BANKS, Proc. 18th Int Conf ICDE, San Jose, CA, USA, (2002) Zhou Y & Croft B, Ranking Robustness: A novel framework to predict query performance, Proc 15th Int Conf ACM, CIKM, Geneva, Switzerland, (2006) Collins-Thompson K & Bennett P N, Predicting query performance via classification, Proc 32nd Int Conf ECIR, Milton Keynes, UK, (2010) Shtok A, Kurland O & Carmel D, Predicting query performance by query-qrift estimation, Proc 2nd Int Conf TIR, Heidelberg, Germany, (2009) Zhao Y, Scholer F & Tsegay Y, Effective pre-retrieval query performance prediction using similarity and variability evidence, Proc 30th Int Conf ECIR, Berlin, Germany, (2008) Hauff C, Murdock V & Baeza-Yates R, Improved query difficulty prediction for the web, Proc 17th Int Conf CIKM, Napa Valley, CA, USA, (2008) Hauff C, Azzopardi L, Hiemstra D & Jong F, Query performance prediction: evaluation contrasted with effectiveness, Proc 32nd Int Conf ECIR, Milton Keynes, UK, (2010) Vishal Jain & Prasad S VAV, Ontology based information retrieval model in semantic web: A review, Proc IJARCSSE, Vol 4, Issue No 8, (2014) Tim Finin, James Mayfield, Anupam Joshi, Scott Cost R & Clay Fink., Information retrieval and the semantic web, Proc 11th Int Conf on Information and Knowledge Management (2002) Antonio Sanfilippo, Stephen Tratz, Michelle Gregory, Alan Chappell, Paul Whitney, Christian Posse, Patrick Paulson, Bob Baddeley., Ryan Hohimer & Amanda White, Ontological Annotation with WordNet, Proc Pacific Northwest National Laboratory,902 Battelle Blvd, Richland, (2005) Pratibha S, Sonakneware & Karale SJ, Ontology based approach for domain specific semantic information retrieval system, International Journal of Engineering Research and Applications (2014)

EFFICIENT APPROACH FOR DETECTING HARD KEYWORD QUERIES WITH MULTI-LEVEL NOISE GENERATION

EFFICIENT APPROACH FOR DETECTING HARD KEYWORD QUERIES WITH MULTI-LEVEL NOISE GENERATION B.Mohankumar 1, Dr. P. Marikkannu 2, S. Jansi Rani 3, S. Suganya 4 1 3 4Asst Prof, Department of Information Technology,