Semantic Inversion in XML Keyword Search with General Conditional Random Fields

Size: px

Start display at page:

Download "Semantic Inversion in XML Keyword Search with General Conditional Random Fields"

Cecilia Watson
5 years ago
Views:

1 Semantic Inversion in XML Keyword Search with General Conditional Random Fields Shu-Han Wang and Zhi-Hong Deng Key Laboratory of Machine Perception (Ministry of Education), School of Electronic Engineering and Computer Science, Peking University Abstract. Keyword search has been widely used in information retrieval systems, such as search engines. However, the input retrieval keywords are so ambiguous that we can hardly know the retrieval intent explicitly. Therefore, how to inverse keywords into semantic is meaningful. In this paper, we clearly define the Semantic Inversion problem in XML keyword search and solve it with General Conditional Random Fields. Our algorithm concerns different categories of relevance and provides the alternative label sequences corresponding to the retrieval keywords. The results of experiments show that our algorithm is effective and 12% higher than the baseline in terms of precision. 1 Introduction As a widely accepted tool, Keyword Search has been extensively used to search information from all kinds of databases, such as document corpus, relation databases, and semi-structure databases. However, the main weakness of Keyword Search is its ambiguity. Given a Keyword Search that consists of only keywords, it is hard to know what the users really want to search. For example, a user wants to find a paper published by IJCAI. The paper is written by Pineau and is about some technique based on point. For the above information requirement, a proper keyword search may be Point based Pineau IJCAI. However, most all existing works compute results by statistical information of keywords without understanding the semantics under these keywords. In fact, if the system knows that Point based is part of the paper s title(sometimes users may hardly remember the whole title, so only type in part of it), Pineau is the author, and IJCAI is the title of a proceeding. As we see, if we can recognize the inner semantic of users input, user s search intent will be much more explicit. Recent work have started to help users to construct semantic queries. [1] extracts structural semantic in graph data, and provide the top-k matching subgraphs according to the query. [2] proposes an automatic keyword query reformulation approach which extracts information in dataset offline and generates semantic of query online. [3] presents a novel system guiding users through a process of increasing semantic to specify their query intention. Pandey and Punera analyze user s search intent by extracting template structure of search queries[4]. Corresponding author. X. Lin et al. (Eds.): WISE 2013, Part I, LNCS 8180, pp , c Springer-Verlag Berlin Heidelberg 2013

2 432 S.-H. Wang and Z.-H. Deng These work fully proved that deeper semantic respect to keywords can greatly help our retrieval. However, different from all methods above, our algorithm mainly concentrate on tagging the keywords with labels in XML database, and gives users top-k matching label sequence according to keyword sequence. Since XML labels can be well semantical, users can affirm what they really need by selecting proper labels. Here we present our main contributions to keyword search on XML database. The Semantic Inversion for Keyword Search Given keyword sequence, our algorithm recognize it into label sequence, so as to understand the semantic of the keywords. We call this recognition Semantic Inversion. In our algorithm, alternative label sequences are provided after keywords are typed in. Users select the best labels matching their keywords, in order to clarify their retrieval intention. As semantic becomes so important in retrieval, Semantic Inversion can be a promising way to optimize the keyword search. Model the Semantic Inversion with CRF If the Semantic Inversion problem were difficult, it could hardly be useful for retrieval. Fortunately, we find out that the Semantic Inversion is similar with the Part of Speech Tagging(POS) and other sequential learning problems. So existing models may be useful. Conditional Random Fields has been proved efficient in sequential learning and results of our experiments also prove that CRFs can solve the Semantic Inversion problem outstandingly. Quantize the Relevance by Weighing Diverse Features Existing algorithms aiming to compute the relations in keyword field always concentrate on one or few factors (LCA of keywords, co-occurrence between keywords, etc.). In our algorithm, keyword-keyword, label-label, keyword-label, different categories of relevance are weighed to quantize the relevance between keyword sequence and label sequence jointly. As we will discuss in later part of this paper, our learning algorithm is to find the best parameters for optimizing the weight of various features. In this paper, we discuss keyword search only in XML domain. The rest of the paper is organized as follows. Section 2 presents the definition of Semantic Inversion for keyword search. Section 3 introduces general CRFs which we have applied to our problem. Details of features and the algorithm are provided in Section 4. The following two parts shows the experiments and several related work. Finally, we close with conclusion in Section 7. 2 Semantic Inversion In this section, we provide the concept of Semantic Inversion in XML domain, and why Semantic Inversion is able to improve keyword search. Definition 1(Label): Given a set of XML files and a word, a tag is the label of the term if the content of the tag contains the word. A single word may have many probable labels.

3 Semantic Inversion in XML Keyword Search 433 For instance: the label of word Pineau is author, and the label of words (appearing together) Torran Dubh can be conflict or caption. In XML files, the label can be regarded as the semantic of its content,so we will not distinguish semantic and label in later parts. Definition 2(Label Sequence): Given a search keyword sequence S consists of a sequence of words, the corresponding label sequence is composed of labels respect to each word in keyword sequence. Apparently, a label sequence can be recognized as various probable label sequences, which express diverse semantic. For instance: the label sequence of the word sequence Point, based, Pineau, IJCAI can be title, title, author, booktitle. Definition 3(Semantic Inversion): Given a set of XML files and search keyword sequence S = {w 1,w 2,,w k }, the problem of semantic inversion is to find sequential label(s) L = {l 1,l 2,,l k } which maximizes Sim(S, L),whereSim(S, L) is a function to evaluate the fitness or relevance of S and L. The answer sequences (label sequences) are given in descending order of Sim(S, L). In our algorithm, the CRF model uses conditional probability Pr(y x) as the relevance function Sim(S, L). Semantic Inversion is quite useful in keyword search. We state it in three aspects: Semantic Inversion Can Help the Search Engine to Improve Accuracy. Traditional search engine may also recognize the word Pineau in the query Point, based, Pineau, IJCAI as a person s name, but after Semantic Inversion, Pineau can be recognized as a author s name, rather than director s, politician s or others. As a result, the misunderstanding of the search engine can be greatly reduced, so the search accuracy is improved. Semantic Inversion Can Help Users Prevent Ambiguity. If the words in query have diverse semantic, Semantic Inversion can provide alternative label sequence. Since tags in XML is semantical and easy to understand, users can clarify their needs by just selecting the proper label sequence. Semantic Inversion Can Reduce the Search Time. When the label sequence is selected, the search range has been greatly narrowed. Search engines only need to search from pages relevant to labels so the search time is greatly reduced. 3 General CRFs CRFs(Conditional Random Fields) have been widely used by sequential algorithms, especially in sequential tagging problems, CRFs outperform other models. This section will briefly review CRFs and the General CRFs which have been applied in our algorithm. 3.1 Conditional Random Fields Assume x = x 1,x 2,,x n is the input keyword sequence and y = y 1,y 2,,y n is the label(the semantic) sequence. x and y have the same length. CRF(Conditional Random Fields)[5] models the conditional probability Pr(y x) by using a Markov random field for the structured y, and find the best y i to maximize Pr(y i x).

4 434 S.-H. Wang and Z.-H. Deng For the keyword sequence x and the semantic sequence y, theglobal feature vector of CRF is the sum of all the local feature functions: F(y, x) = n f(y, x,i) (1) CRF computes the conditional probability with parameter vector w by i=1 where Pr(y x, w) = ew F(y,x) Z w (x) (2) Z w (x) = y e w F(y,x) (3) For the given keyword sequence, the most probable label sequence maximize the conditional probability, Since Z w (x) does not depend on y, we can also say: ŷ =argmaxw F(y, x) (4) y 3.2 Learning Algorithm Training a CRF is to learn λ for maximizing the log-likelihood of a given training set T = {(x k, y k )} N k=1. Meanwhile, we need to penalize the likelihood with a spherical Gaussian weight prior[6] to prevent from overfitting. So the gradient is: L w = [F(y k, x k ) E Pr(y x k,w)f(y, x k )] k w (5) σ 2 The Learning Algorithm seeks the zero of the gradient. In other words, L w = Linear-Chain CRFs and General CRFs Linear-chain CRFs performs well in sequential learning problems such as NP chunking[7], Part of Speech tagging[5], Opinion Expression Identification[8] and Named Entity Recognition[9]. To solve this kind of problems, the Markov field of y should be a linear chain, and the transition features are just between the adjacent y i.in those applications, we suppose labels are sequential and use Linear-chain CRFs to concentrate on the dependence of the adjacent labels(or adjacent segments, Semi-Markov CRF[10]). However, the problems of retrieval are quite different. Several labels appear together to express one subject jointly. Of course, labels are not sequential. We need to structure y to describe the relevance of all pairs of labels, rather than only the adjacent ones. That is general CRF. In general CRFs, the structure of labels can be a complete graph. For its complexity, general CRFs are not so commonly used as Linear-chain CRFs.

5 Semantic Inversion in XML Keyword Search The Approach Semantic Inversion can be naively solved by the random select algorithm and the greedy algorithm. Random select algorithm gives the answer randomly selected from all candidate answers. Greedy algorithm recognize each keyword x i as the label which x i most frequently appears in. However, both algorithms fail to consider the relevance between labels and the relevance between adjacent keywords. To fully considerate all categories of relevance, we employ the general CRFs to model the relevance and then proposed an algorithm to solve Semantic Inversion. The algorithm weighs keyword-label relevance, label-label relevance and adjacent keywords relevance, and uses Gradient Descent algorithm to learn best parameters for the model. 4.1 Features In this section, we concentrate on features used in the general CRF. We need to extract textual features for quantize of relevance keywords and labels beforehand. In our algorithm, there are three categories of features for one keyword sequence-label sequence pair(x, y): f for keyword-label relevance, g for label-label relevance, h and h for the relevance between adjacent keywords. Feature f(x i, y i ) expresses the dependence between the keyword and the label in position i. They appear in the same position, which indicates we try to recognize the keyword x i as the content of label y i. How frequently x i appears under the label y i should be our first consideration. We measure this kind of dependence like: f(x i,y i )= p(x i,y i ) (6) p(x i ) where p(x i,y i ) is the frequency that keyword x i appear in the content of label y i, and p(x i ) denotes the frequency of keyword x i. Since the frequency of the labels could be different one another, we will not consider this factor in feature f. Feature g(y i, y j ) expresses the relevance of two labels. Existing methods(such as SCLA[11]) measures it mainly based on XML files tree-like structure. We measure this relevance based on co-occurrence. The knowledge base can be seen as a set of instances(people, cities, films, etc.), and labels which often appear commonly to describe the same instances should be deeper relevant. g(y i,y j )= p(y i,y j ) (7) p(y i )p(y j ) where p(y i,y j ) is the frequency that label y i and label y j appear together in one instance, p(y i ) denotes te frequency of label y i and p(y j ) for y j, respectively. In general CRFs, this transform feature should be calculated between each pair of labels, which differs from the Linear-chain CRFs. Feature h(x i, x i+1, y i, y i+1 ) and h (x i, x i+1, y i, y i+1 ) measure the relevance of the adjacent keywords. Here we also use the co-occurrence of keywords to measure it: h 0 (x i,x i+1 )= p(x i,x i+1 ) p(x i )p(x i+1 ) (8)

6 436 S.-H. Wang and Z.-H. Deng where p(x i,x i+1 ) denotes keyword x i and keyword x i+1 appear in the content of one label. Adjacent keywords could probably express the similar semantic. h(x i,x i+1,y i,y i+1 ) describes the contribution for recognizing adjacent keywords into the same label: { h 0 (x i,x i+1 ), y i = y i+1 h(x i,x i+1,y i,y i+1 )= (9) 0, y i y i+1 If an alternative label sequences inverses the adjacent keywords into different labels, we also need to penalize. h (x i,x i+1,y i,y i+1 ) is the penalty according to the relevance of the keywords. { h 0, y i = y i+1 (x i,x i+1,y i,y i+1 )= (10) h 0 (x i,x i+1 ), y i y i+1 All the features can be extracted quite easily. For Semantic Inversion, the joint information is contained in feature g and sequential information is concluded in feature h and h. Feature f is the basic and the most natural features for our problem. We put these three sorts of features into general CRF, then learn the parameters. The total weighed features are w F(y, x) =w 1 f(x i,y i )+w 2 g(y i,y j ) i i j + w 3 h(x i,x i+1,y i,y i+1 ) i + w 4 h (x i,x i+1,y i,y i+1 ) i (11) where the parameter vector w =[w 1,w 2,w 3,w 4 ] is what we need to learn from the general CRF. Then we can use the equation (2) and (3) to calculate the probabilities of each alternative label sequence. 4.2 Parameter Learning First, we find out all the keywords and the labels in the Training Data and find out all probable labels of each keyword. Then 5 Experiments In our experiments, the Test Algorithm gives the best 10 probable label sequences for each keyword sequence.

7 Semantic Inversion in XML Keyword Search 437 Algorithm 1. The Learning Algorithm 1: Learn(T rainingdata = {< x, y >}) 2: Find out all the keywords and the labels in T rainingdata, and all probable labels for each keyword. 3: Calculate the features f,g,h,h 4: Initialize CRF 5: repeat 6: Calculate L by Equation (5) 7: Modify w by L: w = w + L 8: until L < threshold 9: return CRF 10: End Learn 5.1 Data Source and Extraction We use Wikipedia dataset for our experiments. Wikipedia dataset contains over 1,000,000 XML documents, involving all fields of knowledge. We randomly select 50,000 documents of them, and no fields are selected particularly. We extract the keyword sequence - label sequence pairs from the infobox of each XML file. The infobox of Wikipedia is the ideal source of our experiments for its neat attribute - content format. We randomly select attributes as the labels, and extract part of the respective content as the keywords. For each label, selected keywords will not be more than 4. The total length of the label sequence will not be more than Evaluation As we discussed before, our algorithm learns from the training set T = {(x 1, y 1 ), (x 2, y 2 ), } which consisting of several keyword sequence - label sequence pairs. The algorithm modified the four parameters (w 1,w 2,w 3 and w 4 ) and imply them into the test set. We evaluate how the answer(label sequences) our algorithm gives resemble the correct (label) sequence. Here the correct sequence is the sequence of the labels, the content of which keywords are exactly contained in. The test contains only the keyword sequence. For each keyword sequence, we generate all probable label sequences. Label sequences with too low f(x i,y i ) feature i (in other words, keywords appear too few times in these labels) will be taken out. The algorithm will grade all the probable label sequences by computing the conditional probability Pr(y x). For each x in the test set S, we concentrate on the best label sequence(with the highest conditional probability) ŷ, and compare it to the correct sequence y. The accuracy is calculated by: Acc = Match(ŷ, y) Length(x) x S x S (12)

8 438 S.-H. Wang and Z.-H. Deng and Match(ŷ, y) = i eq(ŷ, y,i) (13) Another concentration is the accuracy of the best sequence within top-n sequences(in our experiments, N =4, 7, 10). In the real occasion, the search engine should provide users with several alternative results within the first page so that users can choose the best one for their own. Since some input keyword sequences are originally ambiguous, the Top-N accuracy can sometimes be more convincing. Algorithms sorts the label sequences by the conditional probability, and the set TopN(x) contains the top-n sequences. The Top-N accuracy is calculated by: AccN = x S max y TopN(x) Match(y, y) Length(x) x S In the other hand, we want to know how the algorithm ranked the real correct sequence. If the algorithm is efficient, the rank of correct label sequence should be small. We also evaluate the algorithm by how frequently the correct label sequence appear in Top-N(N =1, 4, 7, 10) answers(label sequences). This could show whether the correct sequence has been highly scored. (14) 5.3 Results and Discussion We use the cross-validation to evaluate the experiments. All the keyword sequence - label sequence pairs are split into 10 parts. At each time, 9 parts are used for learning and one part for testing. The code is written by C++. The programs are performed on a server with 4 core processors and 16GB memory. Accuracy of the First Answer. Our algorithms(73.2%) outperforms the baseline algorithms(38.1% for Random Select Algorithm and 61.2% for Greedy Algorithm), which confirms our assumption that fully consideration of various categories of relevance can improve the quality of semantic reversion. Accuracy of Top-N Answers. Figure (a) shows the accuracy of the Top-N answers, N =1, 4, 7, 10. This accuracy is calculated by Equation (11). Users can select the best answer from the N answers our algorithm gives. The best accuracy of Top10 answers can be over 90%! That is to say: Users can lead the search engine to understand the precise semantic by no more than 10% extra work. That is quite exciting. The Rank of the Correct Answer. Figure (b) shows how our algorithm ranks the correct answer. Nearly half of the correct answers are ranked at the first place. Over 80% of cases, the real correct sequence has been ranked before 10 and will be shown within the first page of the results. If so, the only thing users need to do is selecting.

Semantic Inversion in XML Keyword Search 439 (a) Accuracy of Top-N Answers (b) The Rank of the Correct Answer 6 Related Work In this section, we will present some related work around the utilization

Their goal is to find the best segmentation maximizing the conditional probability which is defined by the CRF model. Semi-CRFs perform very well in NER problems[12].

9 Semantic Inversion in XML Keyword Search 439 (a) Accuracy of Top-N Answers (b) The Rank of the Correct Answer 6 Related Work In this section, we will present some related work around the utilization of Conditional Random Fields and algorithms for the keyword search on structured database. Semi-Markov Conditional Random Fields[10] split the sequence into several segments. Their goal is to find the best segmentation maximizing the conditional probability which is defined by the CRF model. Semi-CRFs perform very well in NER problems[12]. For our problem, Semi-CRF can easily find the phrases and the names contained in input sequence, but will not fully describe the joint semantic of the labels because Semi-CRFs are mainly based on the Linear-chain CRFs. On structured data, there is a work focusing on Keyword Query Reformulation[2]. The reformulated queries provide alternative descriptions of original input, so as to better capture users information need and guide users to explore related items in the target structured data. The data are modeled with a heterogenous graph, and a probabilistic generation model is utilized for query reformulation. Its aim is to help users to claim the semantic clearly. In the field of RDF, recent work provides QUICK[3], a novel system for helping users to construct semantic queries in a given domain. QUICK works with the schema graph and the query templates. Users can conveniently express their search intent by increasingly selecting the semantic and the structure provided by QUICK. What is more, an online system with a user-friendly interface has been established based on QUICK. 7 Conclusions and Future Work In nowadays keyword search, good search systems should understand users intent deeply. How to recognize and represent the keyword semantic becomes more and more important. In XML data, labels are naturally semantical, so recognizing the keywords into XML labels is what we need to concentrate on. In this paper we define this process as Semantic Inversion and model it with general conditional random fields. From our experiments, our algorithms can efficiently recognize the keywords into labels and

10 440 S.-H. Wang and Z.-H. Deng top-k label sequences also provide users with the chance to reclaim their real search intent. In the future, we want to construct a semantical knowledge base and establish a better XML retrieval system based on Semantic Inversion. We also hope to expand the Semantic Inversion to other structured data, such as RDF. If semantic can be simply and accurately inversed into other explicit forms, keyword search will surely improve. Acknowledgement. This work is partially supported by Project supported by National Natural Science Foundation of China and Project 2009AA01Z136 supported by the National High Technology Research and Development Program of China (863 Program). References 1. Tran, T., Wang, H., Rudolph, S., Cimiano, P.: Top-k exploration of query candidates for efficient keyword search on graph-shaped (rdf) data. In: International Conference on Data Engineering - ICDE 2009, pp (2009) 2. Yao, J., Cui, B., Hua, L., Huang, Y.: Keyword Query Reformulation on Structured Data. In: International Conference on Data Engineering, ICDE (2012) 3. Zenz, G., Zhou, X., Minack, E., Siberski, W., Nejdl, W.: From keywords to semantic queries - Incremental query construction on the semantic web. Journal of Web Semantics 7(3), (2009) 4. Pandey, S., Punera, K.: Unsupervised Extraction of Template Structure in Web Search Queries. In: International World Wide Web Conference - WWW (2012) 5. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: International Conference on Machine Learning, ICML 2001, pp (2001) 6. Chen, S.F., Rosenfeld, R.: A Gaussian Prior for Smoothing Maximum Entropy Models. Technical Report CMU-CS , Carnegie Mellon University (1999) 7. Sha, F., Pereira, F.C.N.: Shallow parsing with conditional random fields. In: North American Chapter of the Association for Computational Linguistics, NAACL (2003) 8. Breck, E., Choi, Y., Cardie, C.: Identifying expressions of opinion in context. In: International Joint Conference on Artificial Intelligence, IJCAI, pp (2007) 9. McCallum, A., Li, W.: Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning - CoNLL (2003) 10. Sarawagi, S., Cohen, W.W.: Semi-Markov Conditional Random Fields for Information Extraction. In: Neural Information Processing Systems, NIPS (2004) 11. Xu, Y., Papakonstantinou, Y.: Efficient keyword search for smallest LCAs in XML databases. In: International Conference on Management of Data - SIGMOD, pp (2005) 12. Okanohara, D., Miyao, Y., Tsuruoka, Y., Tsujii, J.: Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. In: Meeting of the Association for Computational Linguistics. ACL (2006)

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher - 113059006 Raj Dabre 11305R001 Purpose of the Seminar To emphasize on the need for Shallow Parsing. To impart basic information about techniques