A Novel Framework for Predicting Hard Keyword Queries on Databases using Ontology Concepts

Size: px

Start display at page:

Download "A Novel Framework for Predicting Hard Keyword Queries on Databases using Ontology Concepts"

Ambrose Jennings
5 years ago
Views:

1 A Novel Framework for Predicting Hard Keyword Queries on Databases using Ontology Concepts B.Mohankumar, Dr. P. Marikkannu, S. Suganya Department of IT, Sri Ramakrishna Engineering College, Coimbatore, India AbstractKeyword queries on databases provide easy access to data, but often suffer from low ranking quality, i.e., low precision and/or recall, as shown in recent benchmarks. It would be useful to identify queries that are likely to have low ranking quality to improve the user satisfaction. For instance, the system may suggest to the user alternative queries for such hard queries. In the existing work, analyzes the characteristics of hard queries and propose a novel framework to measure the degree of difficulty for a keyword query over a database, considering both the structure and the content of the database and the query results. However, in this system numbers of issues are there to address. One of the main issues present in the existing work is that, at the time keyword prediction only user submitted keyword will be used for the prediction of the results. The existing work won t concentrate about the semantic meaning present among the key words that are submitted by the users which will lead to inaccurate result retrieval. To overcome this problem in the proposed work, the semantic based key word prediction is proposed by using ontology based representation in which the semantic meaning of the keywords will be analyzed by using the Word Net tool. This will lead to an accurate to k retrieval of document due to consideration of the semantic meaning of the documents in search engine. Index Term: Query Prediction, Query Optimization and Performance, Keyword Query Introduction I. Introduction Keyword query interfaces (KQIs) for databases have attracted much attention in the last decade due to their flexibility and ease of use in searching and exploring the data. Since any entity in a data set that contains the query keywords is a potential answer, keyword queries typically have many possible answers. KQIs must identify the information needs behind keyword queries and rank the answers so that the desired answers appear at the top of the list. Unless otherwise noted, it refers to keyword query as query in the remainder of this project. Databases contain entities, and entities contain attributes that take attribute values. Some of the difficulties of answering a query are as follows: First, unlike queries in languages like SQL, users do not normally specify the desired schema element(s) for each query term. For instance, query Q1: Iron on the IMDB database ( does not specify if the user is interested in movies whose title is Iron or movies distributed by the Iron Company. Thus, a KQI must find the desired attributes associated with each term in the query. Second, the schema of the output is not specified, i.e., users do not give enough information to single out exactly their desired entities. II. Related Works Keyword++: A Framework to Improve Keyword Search Over Entity Databases [3]In this work consider the entity database as a single relation, or a (materialized) view which involves joins over multiple base relations. Essentially, we assume that each tuple in the relation describes an entity and its attributes. This is often the case in many entity search tasks. Here now summarize our main contributions. 1. Here propose a general framework, which builds upon a given baseline keyword search interface over an entity database, to map keywords in a query to predicates or ordering clauses. 2. Here develop techniques that measure the correlation between a keyword and a predicate (or an ordering clause) by analyzing two result sets, from the baseline search engine, of a differential query pair with respect to the keyword. IJIRT INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 61

2 3. Here improve the quality of keyword to predicate mapping by aggregating measured correlations over multiple differential query pairs discovered from the query log. 4. Here develop a system that efficiently and accurately translates an input keyword query to a SQL query. We materialize mappings for keywords observed in the query log. For a given keyword query at run time, we derive a query translation based on the materialized mappings. In this work address the above incompleteness and impreciseness issues under the context of keyword search over entity search. Here map query keywords1 to matching predicates or ordering clauses. The modified queries may be cast as either SQL queries to be processed by a typical database system or as keyword queries with additional predicates or ordering clauses to be processed by a typical IR engine. In this work primarily focus on translating keyword queries to SQL queries. But our techniques can be easily adopted by IR systems. Ontology Based Information Retrieval Model in Semantic Web: A Review [12] To achieve high accurate result for corresponding user query we go for this ontology based information retrieval model. The major contribution of this paper is to provide semantic result and at the same time reducing the error rate significantly. In their research they explained the development of an ontology-based model for the generation of metadata for audio, and the selection of audio information in a user customized manner. Also conclude how the ontology they proposed can be used to generate information selection requests in database queries. The basic method of mapping LSI concepts on given ontology (WordNet), used both for retrieval recall improvement and dimension reduction. They offered experimental results for this method on a subset of TREC collection, consisting of Los Angeles Times articles. In their research they had shown, that mapping terms on WordNet hypernyms improves recall, bringing more relevant documents. The LSI filtration enhances recall even more, producing smaller index, too. The question is, whether use expensive method as LSI just for the term filtration. The third approach using LSI on generated hypernym-by-document matrix has yet to be tested. To demonstrate the potential of proposed model built an experimental prototype which employs the topical ontologies for indexing Web documents in terms of their semantics. Proposed Model is an adaptation of the vector-based ranking model that takes advantage of an ontology based knowledge representation. In this paper, authors have proposed a framework that shows that ontology approach can help novice researchers to apply semantic search techniques to improve current search capabilities. Ontological Annotation with WordNet [14]Ontological annotations identify real-world entities alongside properties and relations that characterize the entities attributes and role in their textual context, with respect to reference ontology. Adding these annotations to unstructured or semistructured data is a basic requirement to make Semantic Web technologies work. For example, the availability of ontologically annotated documents is crucial in enabling the shift from keyword-based queries and navigation by predefined links to semantic-driven search and navigation behaviors that can be effectively handled by automatic agents in Semantic Web applications. Several formalizations of WordNet as OWL ontology have been developed during the last few years and a WordNet Task Force has been created within the W3C Semantic Web Best Practices and Deployment Working Group5 to support the deployment of WordNet and similarly structured lexica in RDF/OWL. One of the main problems with turning WordNet into OWL ontology is the sheer number of resulting concept classes. WordNet 2.0 has some 130,000 synonym sets. If each synonym set is formalized as a concept class, the ensuing number of classes would just be too large and therefore impractical for a real-world application. Moreover, it is not clear whether such a large number of lexical concept classes is needed for applications such as semantic-based search and navigation. While it is important to have as wide a lexical coverage as possible, such an objective can be simply achieved by linking a large number of word senses (e.g. the 130,000 synonym sets in WordNet) to a more manageable number of concept classes. We propose techinque for developing an analytical platform that 1. To provides a WordNet-based ontology offering a manageable and yet comprehensive set of concept classes, IJIRT INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 62

3 2. Leverages the lexical richness of WordNet to give an extensive characterization of concept class in terms of lexical instances 3. Integrates a class recognition algorithm that automates the assignment of concept classes to words in naturally occurring text. The ensuing framework makes available an ontological annotation platform that can be effectively integrated with intelligence analysis systems to facilitate evidence marshaling and sustain the creation and validation of inference models. Ontology Based Approach for Domain Specific Semantic Information Retrieval System [13]There is a very huge number of documents in the World Wide Web, which are growing at a steady pace. Out of all of them finding a relevant one, which one needs, is a challenging task. To search the required document, traditional keyword-based information retrieval technique is used. In keyword based system, user is provided with too many results among which, most results are irrelevant. User faces difficulty in figuring out the exact one which he needs. Therefore, to overcome that limitation of keyword based system, technique of conceptual search is implemented. Conceptual search implies search by meaning instead of matching of keywords. In conceptual search technique, the system understands the meanings of the concepts, finds the relations between concepts that users specify in their queries and then retrieves the semantic answer. As an example, consider the meaning of the word can - taken either as container for storing water or as, be able to. The keyword based system retrieves data for both the meanings. On the other hand, the ontological concept retrieves concept from our domain. This conceptual search technique is implemented by using the concept of ontology. In the field of computer science and information technology, philosopher Gruber defines the concept of ontology as an explicit specification of conceptualization. It means that ontology describes the relationship between different concepts, properties and their attributes. In semantic search system, the concept of ontology is used to search results by contextual meaning of input query instead of keyword matching. Ontology provides a knowledge-sharing framework that supports the representation and sharing of domain knowledge. An increasing number of ontologies are being developed, and their reuse and sharing offers several benefits. One important benefit is that we can significantly save time and effort by reusing existing ontologies instead of building new ones every time. Another advantage is that heterogeneous systems and resources can interoperate seamlessly by sharing a common knowledge. In the proposed system, the meaningful concept is extracted from user s input query. Using this concept, query expansion is performed. Query expansion implies that the query is converted into more meaningful format. In the proposed system, input query is converted into a SPARQL query. SPARQL is an RDF database language. SPARQL query is then fired on to the RDF database and accesses the relevant information. III. Proposed Methodology In our proposed scenario the main contributions are performing query process using ontology with wordnet tool and Improved Structured Robust Algorithm. To provide fast and efficient keyword query result based on the wordnet tool we go for Improved Structured Robust Algorithm. If we give a query then analyze the meaning of the query. In same way determine the related information for corressponding query. Sometimes user query includes characters such as +, space and - for indicating the conjunctive and exclusive situations. Those kinds of charactes are left in our proposed scenario and also eliminate the keywords after character space and _. Our proposed scenario allows word after space or hyphen character and provides semantic meaning for each word in the result set. By doing this process we can improve the quality of search query. Wordnet has a huge lexical or semantic database of a language. It contains nouns, verbs, adjectives and adverbs which are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Improved Structured Robustness Algorithm is used for computing the suitable score value based on the top K result entities. Ontology introduces vocabulary relevant to domain, frequently holds names for classes and relationships. Specifies intended meaning of vocabulary. We will use this ontology concept for handling the complicated queries and large schema is existed in any scenario. In data mining concept it contains IJIRT INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 63

4 definitions of basic data mining entities (e.g., datatype, dataset, data mining task, data mining algorithm etc.) and allows extensions with more complex data mining entities (e.g. constraints, data mining scenarios and data mining experiments). Examples: 1. Query is like jump - existing scenario found only result such as jump, jumped and jumping. It might be with few sentences or keyword alone. 2. Whereas in proposed scenario it will additionally find out the meaning of jump using ontology concept. Semantic word like hop, leap and spring with similar words too. III. System Architecture Original database Query WordNet Tool PRMS ranking Noise Generation in Corrupted database Spearman rank correlation Ontology Concepts Improved SR Score Approximation algorithm Performance evaluation Semantics & Top K Fig1: System Architecture It shows the complete work flow of the proposed system and illustrate that how the system artichecture has been created for detecting the hard queries with multi-level noise generation and produes the quality result for the given query. IV. Experiments 1. Noise Generation in Databases In order to compute SR, we need to define the noise generation model fxdb (M) for database DB. It will show that each attribute value is corrupted by a combination of three corruption levels: on the value itself, its attribute and its entity set. Now the details: Since the ranking methods for queries over structured data do not generally consider the terms in V that do not belong to query Q, we consider their frequencies to be the same across the original and noisy versions of DB. Given query Q, let x be a vector that contains term frequencies for terms w Q V. Similarly, we simplify our model by assuming the attribute values in DB and the terms in Q V are independent. The corruption model must reflect the challenges about search on structured data, where we showed that it is important to capture the statistical properties of the query keywords in the attribute values, attributes and entity sets. We must introduce content noise (recall that we do not corrupt the attributes or entity sets but only the values of attribute values) to the attributes and entity sets, which will propagate down to the attribute values. For instance, if an attribute value of attribute title contains keyword Godfather, then Godfather may appear in any attribute value of attribute title in a corrupted database instance. Similarly, if Godfather appears in an attribute value of entity set movie, then Godfather may appear in any attribute value of entity set movie in a corrupted instance. 2. Ranking in Original Database With the mapping probabilities estimated as described above, the probabilistic retrieval model for semi-structured data (PRMS) can use them as weights for combining the score from each element into a document score, as follows: IJIRT INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 64

5 Here, the mapping probability P M (Ej w) is calculated and the element-level query-likelihood score P QL (w ej) is estimated in the same way as in the HLM approach. The rationale behind this weighting is that the mapping probability is the result of the inference procedure to decide which element the user may have meant for a given query term. For instance, for the query term `romance', this model assigns higher weight when it is found in genre element as we assume that the user is more likely to have meant a type of movie rather than a word found in plot. One may imagine a case where the user meant `megryan' to be words in the title and `romance' to be in the plot. Given that our goal is to make the best guess with the minimal information supplied by user, however, the PRMS will not rank movies that match this interpretation as highly as the more common meaning. Movies that do match this interpretation will, however, appear in the ranking rather than being rejected outright which would be the case if we were generating structured queries. The experimental results based on collections and queries taken from the actual web services support the claim that the common interpretation is usually correct. 3. Ranking in Corrupted Database With the mapping probabilities estimated as described above, the probabilistic retrieval model for semi-structured data (PRMS) can use them as weights for combining the score from each element into a document score for same equation. Here, the mapping probability P M (Ej w) is calculated and the element-level query-likelihood score P QL (w ej) is estimated in the same way as in the HLM approach. The rationale behind this weighting is that the mapping probability is the result of the inference procedure to decide which element the user may have meant for a given query term. For instance, for the query term `romance', this model assigns higher weight when it is found in genre element as we assume that the user is more likely to have meant a type of movie rather than a word found in plot. One may imagine a case where the user meant `megryan' to be words in the title and `romance' to be in the plot. Given that our goal is to make the best guess with the minimal information supplied by user, however, the PRMS will not rank movies that match this interpretation as highly as the more common meaning. Movies that do match this interpretation will, however, appear in the ranking rather than being rejected outright which would be the case if we were generating structured queries. The experimental results based on collections and queries taken from the actual web services support the claim that the common interpretation is usually correct. 4. Improved Structured Robustness Algorithm We compute the similarity of the answer lists using Spearman rank correlation. It ranges between 1 and 1, where 1, 1, and 0 indicate perfect positive correlation, perfect negative correlation, and almost no correlation, respectively. To computes the Structured Robustness score ( SR score), for query Q over database DB given retrieval function g: SR(Q, g,db,xdb) = E{Sim(L(Q, g,db), L(Q, g,xdb))} wheresimdenotes the Spearman rank correlation between the ranked answer lists. Algorithm: 1. Consider the input query Q, Inverted Index I, Number of relations exist in the ontology R, Finite Set O (Ontology), Similarity word, Top k result List L of Q by ranking function g, Number of corruption iteration N. 2. ISR 0 ; C {}; 3. For i=1 N do 4. For (inti=0; i<word length; i++); 5. WordInformation[i]=Find WordInnformation for Words(i) by Wordnet 6. For (j=1; j<r;j++) 7. R= // no.of relations exist in ontology concept. 8. For each concept of ontology IJIRT INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 65

6 9. IfType(wordType.word) is a noun then 10. worddistance= wordtype.wordgetsimilarity (concept of ontology), return 11. Build similarity matrix 12. Improve the relevance score value based on the web page counts. 13. Then do the same process for corrupted database. 14. For each result R in L do 15. For each attribute value A in R do 16. Obtain corrupted version of A. 17. For each keyword w in Q do 18. Calculate number of w in corrupted A. 19. Perform for all words in a given query. 20. Read the character after +, - and hyphen using I-SR algorithm. 21. Update all the metadata values. 22. Compute the ranking using function g and correlation method. 23. Obtain ISR result with semantic result using ontology concept. 24. Return ISR+=Semantic and Similarity top k result (L, ) Algorithm shows the Improved Structured Robustness Algorithm (SR Algorithm), which computes the exact I-SR score based on the top K result entities. We use ontology based wordnet too; to improve the searching efficiency and result efficiency. Each ranking algorithm uses some statistics about query terms or attributes values over the whole content of DB. Some examples of such statistics are the number of occurrences of a query term in all attributes values of the DB or total number of attribute values in each attribute and entity set. These global statistics are stored in M (metadata) and I (inverted indexes) in the SR Algorithm pseudocode. SR Algorithm generates the noise in the DB on-thefly during query processing. Since it corrupts only the top K entities, which are anyways returned by the ranking module,it does not perform any extra I/O access to the DB, except to lookup some statistics. Moreover, it uses the information which is already computed and stored in inverted indexes and does not require any extra index. Also it allows for reading and providing the similarity words and semantic meaning of the strings and words after hyphen. Hence it increases the robustness of query and query result using ontology concept and ISR algorithm. Finally we can get similarity result and semantic meaning of corresponding query based on the ranking method. 5. Approximation Algorithms In this section, we propose approximation algorithms to improve the efficiency of SR Algorithm. Our methods are independent of the underlying ranking algorithm. Query-specific Attribute values Only Approximation (QAO-Approx):QAO-Approx corrupts only the attributevalues that match at least one query term. Hence, we can significantly decrease the time spent on corruption if we corrupt only the attribute values that contain query terms. We add a check before Line 7 in SR Algorithm to test if A contains any term in Q. Hence, we skip the loop in Line 7. The second and third levels of corruption (on attributes, entity sets, respectively) corrupt a smaller number of attribute values so the time spent on corruption becomes shorter. Static Global Stats Approximation (SGS - Approx):SGSApprox uses the following observation: Given that only the top-k result entities are corrupted, the global DB statistics do not change much. Once we get the ranked list of top K entities for Q, the corruption module produces corrupted entities and updates the global statistics of DB. Then, SR Algorithm passes the corrupted results and updated global statistics to the ranking module to compute the corrupted ranking list. Combination of QAO-Approx and SGS-Approx: QAO-Approx and SGS-Approx improve the efficiency of robustness calculation by approximating different parts of the corruption and re-ranking process. Hence, we combine these two algorithms to further improve the efficiency of the query difficulty predication. V. Results Recall Recall value is calculated is based on the retrieval of information at true positive prediction, false negative. In healthcare data precision is calculated the percentage of positive results returned that are Recall in this context is also referred to as the True Positive IJIRT INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 66

$Rate. Recall is the fraction of relevant instances that are retrieved, Recall = TP (True positive) If the outcome from a prediction is p and the actual value is also p, then it is called a true$ FP (False positive) If the outcome from a prediction is p and the actual value is n then it is said to be a false positive (FP).

FP (False positive) If the outcome from a prediction is p and the actual value is n then it is said to be a false positive (FP).

7 Rate. Recall is the fraction of relevant instances that are retrieved, Recall = TP (True positive) If the outcome from a prediction is p and the actual value is also p, then it is called a true positive (TP); TN (True negative) A true negative (TN) has occurred when both the prediction outcome and the actual value are n in the number of input data. FP (False positive) If the outcome from a prediction is p and the actual value is n then it is said to be a false positive (FP). FN (False negative) False negative (FN) is when the prediction outcome is n while the actual value is p. From this graph value we can say that our proposed scenario constructs maximum recall values rather than our existing scenario. We use Improved Structured Robustness algorithm in proposed scenario to build high accurate results. It takes number of queries in x-axis and recall values in y- axis. For number of queries it generates the higher recall values in current scenario. Finally we can say that our proposed system is higher performance than existing scenario. Precision Precision value is calculated is based on the retrieval of information at true positive prediction, false positive.in healthcare data precision is calculated the percentage of positive results returned that are relevant. Precision = From this figure we can conclude that our proposed scenario produce maximum precision values rather than our existing scenario. By using I-SR method proposed sytem achieves high precision values compare than SR method in existing system. It takes number of queries in x-axis and precision values in y- axis. For number of data values it generates the higher precision values in current scenario. Thus we achieve the result in greater for proposed system rather than the existing system. Time Complexity From this given graph the time complexity is long in existing scenario and low in proposed scenario. We use ontology with wordnet tool hence it extracts the necessity answer set fastly and effciently with relevant top k results. We consider the methodologies in x axis and total time factor in y axis. By using proposed I-SR with ontology concept it takes minimum amount of time for computation in proposed system. Finally we can coclude that our proposed scenario is superior to existing scenario. VI. Conclusion In the existing work, analyzes the characteristics of hard queries and propose a novel framework to measure the degree of difficulty for a keyword query over a database, considering both the structure and IJIRT INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 67

8 the content of the database and the query results. However, in this system numbers of issues are there to address. They are, searching quality is lower than the other system and reliability rate of the system is lowest. In order to overcome these drawbacks, we are proposing the improved ranking algorithm which is used to enhance the accuracy rate of the system. This proposed system is well enhancing the reliability rate of the difficult query prediction system. Some query topics contain characters +" and " to indicate the conjunctive and exclusive conditions. In the existing system, we do not use these conditions and remove the keywords after character ". In our proposed difficult query prediction system, use these operators to improve search quality. In other words, this work is support these operators for efficient result. Additionally we proposed the concept of ontology based wordnet tool for improving the top k results efficiency. It is used for providing not only similar words but also semantic meaning for the given word query. From the experimentation result, we are obtaining the proposed system is well effective than the existing system by means of accuracy rate, quality of result and short threshold time. VII. References [1] V. Hristidis, L. Gravano, and Y. Papakonstantinou, Efficient Irstyle keyword search over relational databases, in Proc. 29 th VLDB Conf., Berlin, Germany, 2003, pp [2] Y. Luo, X. Lin, W. Wang, and X. Zhou, SPARK: Top-k keyword query in relational databases, in Proc ACM SIGMOD, Beijing, China, pp [3] V. Ganti, Y. He, and D. Xin, Keyword++: A framework to improve keyword search over entity databases, in Proc. VLDB Endowment, Singapore, Sept. 2010, vol. 3, no. 1 2, pp [4] J. Kim, X. Xue, and B. Croft, A probabilistic retrieval model for semistructured data, in Proc. ECIR, Tolouse, France, 2009, pp [5] N. Sarkas, S. Paparizos, and P. Tsaparas, Structured annotations of web queries, in Proc ACM SIGMOD Int. Conf. Manage.Data, Indianapolis, IN, USA, pp [6] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan, Keyword searching and browsing in databases using BANKS, in Proc. 18th ICDE, San Jose, CA, USA, 2002, pp [7] Y. Zhou and B. Croft, Ranking robustness: A novel framework to predict query performance, in Proc. 15th ACM Int. CIKM, Geneva, Switzerland, 2006, pp [8] B. He and I. Ounis, Query performance prediction, Inf. Syst., vol. 31, no. 7, pp , Nov [9] K. Collins-Thompson and P. N. Bennett, Predicting query performance via classification, in Proc. 32nd ECIR, Milton Keynes, U.K., 2010, pp [10] S. M. Katz, Estimation of probabilistic from sparse data for the language model component of a speech recognizer, IEEE Trans. Signal Process., vol. 35, no. 3, pp , Mar [11] C. Hauff, L. Azzopardi, D. Hiemstra, and F. Jong, Query performance prediction: Evaluation contrasted with effectiveness, in Proc. 32nd ECIR, Milton Keynes, U.K., 2010, pp [12] Vishal Jain and Dr. S. V. A. V. Prasad Ontology Based Information Retrieval Model in Semantic Web: A Review in Proc. IJARCSSE Volume 4, Issue 8, August [13] Tim Finin, James Mayfield, Anupam Joshi, R. Scott Cost and Clay Fink Information Retrieval and the Semantic Web in Proc 11 th International Conference on Information and Knowledge Management, pp , 2002 ACM. [14] Antonio Sanfilippo, Stephen Tratz, Michelle Gregory, Alan Chappell, Paul Whitney, Christian Posse, Patrick Paulson, Bob Baddeley, Ryan IJIRT INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 68

9 Hohimer and Amanda White Ontological Annotation with WordNet in Proc Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, pp , IJIRT INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 69

Ontology Based Prediction of Difficult Keyword Queries

Ontology Based Prediction of Difficult Keyword Queries Lubna.C*, Kasim K Pursuing M.Tech (CSE)*, Associate Professor (CSE) MEA Engineering College, Perinthalmanna Kerala, India lubna9990@gmail.com, kasim_mlp@gmail.com