A Bayesian Approach to WSD for the Retrieval of. XML Documents.

Size: px

Start display at page:

Download "A Bayesian Approach to WSD for the Retrieval of. XML Documents."

Percival Patterson
5 years ago
Views:

1 A Bayesian Approach to WSD for the Retrieval of XML Documents Marco Mesiti 1,Paolo Rosso 2, Marina Merlo 1 1 Dipartimento di Informatica e Scienze dell'informazione -Universita digenova, Italy fmesiti,mmerlog@disi.unige.it 2 Dpto. de Sistemas Informaticos y Computacion - Univ. Politecnica de Valencia, Spain prosso@dsic.upv.es Abstract Sources of XML documents are today proliferating on the World Wide Web. An important feature of XML is that information on documents structures is available on the Web together with the documents contents. This information can be exploited to improve document handling and to improve query processing. In such an heterogeneous environment as the Web, it is not reasonable to assume that there arexmldocuments which always satisfy a certain query. A metric for quantifying the structural similarity between an XML document and a query is necessary. The aim is to develop a technique which could allow for aproximate quering, that is, based on structural similarity and synonymy between tags of XML documents. In this paper, we present an algorithm for the retrieval of XML documents which is based on the structural and semantic similarity of a document with a given query. For the semantic indexing of the tags of XML documents and queries, the naive Bayesian approach and the WordNet ontology were used. 1 Introduction XML (extensible Markup Language) [14] is a markup language that has recently emerged as the most relevantstandardisation eort for document representation and exchange on the Web. XML is a simplied version of SGML, designed for use on the Web. Its goal is to provide the same advantages of SGML, while atthe same time providing a language for creating documents on the Web which is easier to learn and use than SGML. HTML, created from SGML for using on the Web, is an helpful tool for presentingdocuments, butitisnot adequate for exchanging them. The main advantages of XML with respect to HTML are related to the possibility ofdening tags, nested document structures, and document types (called DTDs: Document Type Denitions) that describe the structure of documents. The building blocks of XML are nested, tagged elements. Each tagged element has a sequence of zero or more attribute-value pairs, and a sequence of zero or more subelements. These subelements may themselves be tagged elements, or they may be \tagless" segments of natural language text data. A well-formed XML document isadocument that follows the grammar rules of XML [14], but itplacesno restriction on tags, attribute names, orprop- erly nesting ofpatterns. Alternatively, a document can be coupled with a DTD, which is essentially a grammar for restricting the tags and the structure of a document. An XML document conformingto a DTD is considered valid. The need of shifting from exact queries with boolean answers to approximate queries with ranked results has emerged as a requirement of XML query languages for searching the Web [7]andsomeapproaches in this direction have been developed [3]. These approaches, however, consider similarity only between terms appearing inthe documents (i.e., similarity of content) disregarding structure similarity. In the paper we propose an algorithm for matching an XML document against a DTD query with respect to its structure and itstag similarity. In fact, XML documents, because of semantic tags, are self-describing. The aim is to exploit the structural similarityinthedevelopment of XML-based search engines able to extract, from the Web, documents (or portions of documents) which are also semantically similar to a given query. The remainder of the paper is structured as follows. Section 2 presents the matching algorithm based on structural similarity in the eld of classication of documents. Sec-

2 tion 3 deals with our tree representation of XML documents and queries, whereas Section 4 discusses our Bayesian approach for the disambiguation of queries and documents. Section 5 presents the semantic versionofthe matching algorithm in order to label the tags of DTD queries and XML documents with the concepts of the WordNet ontology. Section 6 some preliminary resultsofthe retrieval of the XML documents which match a given query with respect to structure and synonymy between tags. Finally, Section 7 concludes the work, by discussing extensions and applicability of the proposed technique. 2 From Classication to Retrieval of XML Documents The approach we propose for querying XML documents relies on previous work carried out for XML document classication and for disambiguation of natural language queries. In this section we present the main features of the classication approach, whereas in the next one we discuss how to take into account semantics. In [2] an approach for the classication of XML documents against a set of DTDs was proposed. Such an approach relies on a similarity measure that evaluates the similarity degree between an XML document and a DTD counting the exceeding and missing elements of the document with respect to the DTD (named, respectively, plus and minus elements). Such elements are, then, weighted accordingtothelevel in the hierarchical structures in which they are detected and to the complexity of their structures. Starting from a tree representation of an XML document D and a DTD T, the similarity measure is computed by means of the matching algorithm Match which produces a triple (p,m,c) for the pair (D,T). The triple evaluates the plus, minus, and common elements of the two structures taking into account the weight associated to the plus and minus elements. Such algorithm is based on the idea of locally determining the best structure for a DTD element, as soon as the information on the structure of its subelements in the document are known. The Match algorithm recursively visits the document and the DTD, at the same time, from the roottothe leaves, to match common elements. Specically, twodistinct phases can be distinguished: 1. in the rst phase, moving down in the trees from the roots, the parts of the trees to visit through recursive calls are determined, butnoevaluation is performed 2. when a terminal case is reached, on return from the recursive calls and going up in the trees, the various alternatives are evaluated and the best one is selected. Intuitively, in the rst phase the DTD is used as a \guide" todetect theelements ofthe document that are covered by the DTD, disregarding the operators that bind together subelements of an element. In the secondphase, by contrast, the operators used in the DTD are considered to verify that elements are bound as prescribed by the DTD, and to dene an evaluation of the missing or exceeding parts of the document with respect to the DTD. The algorithm matches a document against a query and obtains a structural similarity value given by the function E which isdened as follows: E((p m c)) c p + c + m (1) The function is based on two real parameters and, s.t.,0. Depending on the value assigned to these parameters, the function gives more relevance to plus elements with respect tominus elements, or vice- versa. For example, if 0 and 1, the function does not take into account plus elements in measuring similarity. Therefore, a document with only extra elements with respect to the ones specied in the DTD has a similarity measure equal to 1. By contrast, if 1 and 0 the evaluation function does not take into account minus elements in the similarity measure. We assume that 1, thus giving the same relevance to plus and minus elements. An exhausted description of the matching algorithm goes beyond the scope of this paper. For the detailed version see [8]. 3 Documents and Queries as Trees In the previous section we already mentioned the use of a tree representation for documents and queries. For sake of clarity, we consider sample XML documents, that is, documents formed by nested elements, disregarding attributes and order of elements (i.e., we focus on data-centric documents). Moreover,

3 <movie> <title>la vita e bella</title> <cast> <actor>r. Benigni</actor> <actor>n. Braschi</actor> <actor>m. Paredes</actor> </cast> <story> ::NLtext:: </story> <production> <producer>g. Braschi</producer> <producer>e. Ferri</producer> </production> <>R. Benigni</> </movie> (a) movie title "La vita story e bella" cast production "R.Benigni" "...NL text..." actor actor producer producer actor "M.Paredes" "R.Benigni" "G.Braschi" "E.Ferri" "N.Braschi" (b) Figure 1: XML documentand corresponding tree representation we consider queries dened for the retrieval of the entire document, disregarding queries that can return parts of documents. 3.1 Tree Representation of XML Documents In what follow we use the tree representation of XML documents presented in [2], in which an XML document is represented as a labelled tree. Each nodeofthe tree represents anele- ment oravalue. The label associated to a node represents the corresponding tag name or value. The labels used to tag the tree belongto a set of elementtags (EN)andtoaset of values that the data content of an element can assume (V). In each tree that represents an XML document the root label belongs to EN, being the document element name. Figure 1(b) shows the tree representation of the XML document of Figure 1(a) Queries as Labelled Trees A query is also represented as a labeled tree. In the tree representation, in order to represent optional elements, and alternative of ele- 1 Explicit direction of edges is omitted. All edges are oriented downward. content ANY film actor N.Moretti OR N.Moretti Figure 2: Query expressed as a tree ments, the set of operators OP f OR?g is introduced. The operator represents a sequence of elements, the OR operator represents analternative of elements, and the? operator represents anoptional element. In our representation of queries each node corresponds to an element tag, to an element typeorvalue, to anoperator, or to a predicate. In each tree which represents a query there is a single edge outgoing from the root, and the rootlabel belongs to EN (it is the name ofthe main element ofdocuments we wish to retrieve by the query). Moreover, there can be more than one edge outgoing from a node, only if the node is labelled by or OR. All nodes labelled by types/values are leaves of the tree. Finally, ifanodeislabelled by a predicate, then it has only one child (i.e., a leaf of the tree labelled by a value). Figure 2 shows the tree representation of a query. We remark that the introduction of operators OP f OR?g allows us for representing the structure of all kinds of DTDs. 4 Bayesian Approach for WSD: from NLP vs XML Word Sense Disambiguation (WSD) is the problem of assigning the appropriate meaning (or sense) to a given word in a text (or discourse). Resolving the ambiguity ofwords is a central problem for large scale language understanding application and their associate task [9]. Besides, WSD is one ofthe most importantopen problem in Natural Language Processing (NLP). The election ofthe proper sense is non-trivial undertakinggiven thephenomenon of polysemy. Aword is disambiguated along withthe surrounding words of the text in which itisembedded, that is, along its context. Todetermine the context of each word to sense-tag, a window of a certain size is moved along the text. Words and context, which are already tagged with alabel representing the corresponding syntactic category

4 film Fellini date > 1974 Figure 3: The query expressed as a tree (noun, verb, adjective oradverb), are usually processed together with a lexical relations database like WordNet [13]. The WordNet ontology is organised around the notion of synset, that is, set of synonyms. In fact, WordNet represents concepts aslists comprised of the lexical entries that can be used to express the concept. Dierent are the machine learning algorithms, supervised and unsupervised, which perform the WSD task of NLP. One of these statistical learning methods is the naive Bayesian approach which, assuming the independence of features (i.e., words of the context), it classies a new example (i.e., it sense tagsanew word) by assigning the class (i.e., the synset) that maximises the conditional probability ofthe class given the observed sequence of features (i.e., words of its context) of that example. The formula of the naive-bayes is dened as follow [11], where c j represents the n words of the context, s asenseofthe word to disambiguate and S the set of its senses: max s2s ny P (s) P (c j js) (2) j1 For instance, in the natural language sentence: \What lms did the Fellini make after the date 1974?" if we want to disambiguating the word the words of its context could be date and lm (stemmed version of lms). Model probabilities are estimatedduring the training process using relative frequencies. Asense- tagged like SemCor corpus, which consistof a portion of the Brown corpus tagged with WordNet senses, has to be used during the training andtesting phases. At the end,, should be tagged with the synset which refers to stage. Toavoid the eect of zero counts, the simple at discount smoothing technique can be used [15]. A query like that ofabove, if expressed s argmax si 2S P(si)60P (s i) W ij j1 where: ( (1 ; P ) P (cjjs i) if P (c jjs i) 6 0 n W ij ( k1 P (ckjsi)) if P (c jjs i)0 WN ;jw ij nsig si : occurrences of the lemma with sense s i nlemma: occurrences of the lemma to disambiguate nnamesig wij : occurrences of the context lemma c j with sense s i P (s i) 0 < 1 nsigs i nlemma P (c jjs i) nnamesigw ij nsigsi WN 94474, WordNet nouns ny W i fc j j 1 j nep(c jjs i) 6 0g Figure 4: Smoothed disambiguation algorithm through a query language dened for XML (like XPath [6]) would be translated as: /lm/\fellini" /lm/date>\1974" The Figure 3 shows the tree representation of the previous XPath query. At the semantic indexing, each tag of the DTD query will have to be tagged with the right WordNet concept (i.e. synset) in order to allow for the conceptual query of searching all documents which deal with the concept lm and satisfy certain requirements. In the case of our interest begin a search of certain particular documents with the keyword lm, the ones with the equivalent tag would not be retrieved. In order to apply the naive-bayes NLP technique in the more structured XML world, we have toredene what acontext of a word (i.e. atag) to disambiguate could be. In our context, we dene the surrounding words of a tag as its father, its sons and its brothers. For instance, the context of the tag would be given by lm and date. In Figure 4 is presented the smoothed disambiguation algorithm which was used.

5 5 A Semantic Matching Algorithm for retrieval of XML Documents Our semantic matching algorithm for the retrieval of XML documents is based on the preliminary version which was developed for their classication. In fact, the query evaluation process can be seen as an information retrieval method that usethe hierarchical structure of the document for identifying possible answers tothe query. A user wishing to retrieve documents ofa certain type from a source may not know the structures used in the source for representing such type of documents. Moreover, the possibility toevolve the structure of the DTDs may make unusable queries already dened by the users. For such reasons we aimtodene a query resolution mechanism based on similarity. In general, a query, expressed through a query language dened for XML (like XPath), is translated into adocument, called document template. Thedocument template represents the structural and content constraints adocument should satisfy in order to bean answer to the query. Thus, roughly speaking, adocument template can be seen as a DTD with content constraints. Intuitively, adocument thatweakly conforms to the document template is an answer tothe query. Thus, our idea is todene an approachto querying XML documents based on the classication algorithm. The documents in a source are classied against the document templateandthedocumentssatisfyingits constraints are collected into a set (called query answer set). Note that, because our approach to retrieving documents is based on the classication algorithm, the query answer set can contain similar documents with respect to the document template, both for what concerns the structure and the vocabulary. Since we are weakening the notion of conformance for what concerns document structure, the requirement of tag equality can be weakened as well and tag similarity can be considered. The semantic version of the Match algorithm handles this possibility by relying onanontology like WordNet for evaluating tag similarity. First of all, it has to be said thattags in the document andinthe query need not to be exactly the same: they are considered equivalent ifthey are morphological variants, or stems (like asingular term and its plural or two dierently spelled versions of a term). Moreover, two tags are considered similar if according tothe WordNet ontology,they belong tothe same synset (e.g. lm and movie). Acertain anity value can be assigned in order to take into account the synonymous similarity: 01. A typical value we refer is 0.8 (as considered in [4]). In the similarity measure, when matchingtwotags, thevalue 1- is added to the component m of the subtree evaluation. In this way we capture the missing tag equality. Tag similarity has been handled by extending the approach taken for DTDs with dierent elements withthe sametag at the same level. Also in this case, dierent matches are possible for an element withthe tag in the document. All possible matches must be considered and evaluated, and the bestone (i.e., the one leading tothe highest similarity) must be selected. In this case, however, the DTD could contain elements, subelements of the same element, whose tags are synonyms, but whose structure are dierent. Thus, each subtree under an element tagged with a synonym of that tag in the document must be matched against all of them, resulting inthe same subtree in the document being visited several times. 6 Experimental Results Unfortunately we still lack of large XML repositories: the generation of sources of XML data for evaluation purposes as well as the development ofevaluation methods is indeed mentioned as the rst research issue to be addressed [5]. To validate the proposed technique used by our search engine wehave performed some experiments on \real data" on movies, gathered from the Web, and \synthetic data" generated in order to introduce some polysemy between XML tags (e.g. tags with lm as photographic lm). In case of real data we extracted over 1000 XML documents from HTML documents describing movies. This datacome from two sources of related documents on the Web (sites: and For queries like the one of Figure 5, the matching ofthe documents ofthe collection and the query was performed [10]. The structural and semantic similarity value was calculated for each matchbetween the DTD query and each XML document. A ranking ofevery possible matching was obtained. Figure 5

6 Fellini film Fellini syn > 1974 date Fellini film "stage " > date movie syn syn "date" "movie, picture, picture show" movie date date 1980 Fellini year month 1984 April 0.92 > 0.83 Figure 5: Example of three dierent matches shows a case of perfect matching (similarity value equal to 1)and two cases in which the similarity between the document template and the XML document is quite high (values equal to 0.92 and 0.83, respectively). Precision and recall measures 2 [1] were calculated for every DTD query. For instance, for the query of Figure 5, with a similarity threshold equal to 1 (i.e., we are interested in XML documents which exactly match the query) we obtained: precision 1 and recall 0.27 In case the similarity threshold is equal to 0.88 (i.e., we are interested in documents which are very similar to our query) we obtained an important improvementofthe recall measure but still a good precision: precision 0.8 and recall Conclusions and Further Work In this paper an approach for querying XML sources was developed relying on structural and semantic similarity. As a DTD denes structural constraints onadocument tobe one ofits instances, a query denes structural and content constraints onadocument tobe an anwser to the query. Based on such idea our approach considers a query, expressed through an XML query language, and translates it into adocument, named Document Template. The document template represents 2 precision ) recall ) jfrelevant docsg\fretrieved docsgj jfretrieved docsgj jfrelevant docsg\fretrieved docsgj jfrelevant docsgj the structural and content constraints a document should satisfy in order to be an answer to the query. Our idea was to dene anap- proachto querying XML documents based on the classication algorithm. The documents in a source are classied against the document template andthe documents satisfying its constraints are collected. In order to beable to retrieve a greater number of documents matching a certain query (i.e. to increase the recall of the retrieved documents), the notion of structure and tag similarity was taken into account. Considering the similarity between tags, we allow for conceptual queries. We have proposed a measure to evaluate the structural and semantic similarity of a document with respect to a DTD. Such measure can be used for the retrieval of those documents which matchthe query. Thesemantic indexing was done usingthe WordNet synsets tags. The corpus-based naive-bayes approach, generally used for sense-taggingunstructured text, was employed to sense disambiguate the tags of the structured labelled trees. The rst experimental results we obtained, are promising even if more experiments onrealand synthetic data need to be performed, especially with bothindexed queries and documents. Unfortunately,the lackoflargexmlrepositories is still a problem, as well as the lack of sense tagged corpus for XML documents. In order to overcomewiththe latter problem, for the WSD task we aretaking into consideration the use of a fully automatic knowledge-based method which relies only on the lexical relations of the WordNet ontology and does not need any training data. Acknoledgements We wish to thank Stefania Lombardo for gathering movie HTML pages from the Filmup and 35mm Web sites and mapping them onto XML documents. The work of Paolo Rosso was partially supported by the research grant of the Vicerrectorado de Investigacion, Desarrollo e Innovacion (UPV) and by the Spanish research project (CYCIT) TIC C02. References [1] R. Baez and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, [2] E. Bertino, G. Guerrini, I. Merlo and M. Mesiti. An Approach to Classify Semis-

7 tructured Objects. In R. Guerraoui (Ed.), Proc. of the Thirteenth European Conference on Object-Oriented Programming, number 1628 in Lectures Notes in Computer Science, pp , [3] D. Carmel, Y. Maarek and A. Soer. XML and Information Retrieval: a SIGIR 2000 Workshop. SIGMOD Record, 30(1):62-65, [4] S. Castano, V. De Antonellis, M. G. Fugini and B.Pernici. Conceptual Schema Analysis: Techniques and Applications. ACM Transctions on Database Systems, 23(3): , September [5] S. Ceri, P. Fraternali and S. Paraboschi. XML: Current Developments and Future Challenges for the Database Community.In C. Zaniolo, P. Lockermann, M. Scholl and T. Grust (Eds.), Proc. of the Seventh Int'l Conf. on Extending Database Technology, Vol of Lectures Notes in Computer Science, pp Springer, [6] T. Chinenyanga and N. Kushmerick. An Expressive and Ecient Language for XML Information Retrieval. Journal of the American Society for Information Science an Technology, Special Issue on XML and Information Retrieval, [7] M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri and K. Shim. XTRACT: A System for Extracting Document Type Descriptors from XML Documents. In Proc. of the ACM SIGMOD Int'l Conf. on Management ofdata, pp , [8] E. Bertino, G. Guerrini and M. Mesiti. Measuring the Structural Similarity among XML Documents and DTDs. Internal Report, Dept. of Informatica e Scienze dell'informazione, Universita di Genova, Italy, 2002 (also submitted to a Journal to be published). [9] N. Ide and J. Veronis. Introduction to the Special Issue on Word Sense Disambiguation. Computational Linguistics, 24, [10] M. Merlo. Un Approccio all' Interrogazione di Documenti XML basato su una Misura di Similarita tra Strutture. Thesis, Dept. of Informatica e Scienze dell'informazione, Universita di Genova, Italy, April [11] D. Jurafsky and J. Martin. Speech and Language Processing. Prentince Hall, [12] S. Landes, C. Leacock. and R.I Tengi. Buildind Semantic Concordance. In Fellbaum C. (Ed.), WordNet: An Electronic Lexical Database, pp , MIT Press, Cambridge, MA, USA. [13] A. Miller. WordNet: A Lexical Database for English. Communications of the ACM, 38(11):39-41, November [14] W3C. Extensible Markup Language (XML) 1.0, [15] S. Young and G. Bloothooft. Corpusbased Methods in Language and Speech Pocessing. ELSNET book edition, 1997.

Tag Semantics for the Retrieval of XML Documents

Tag Semantics for the Retrieval of XML Documents Davide Buscaldi 1, Giovanna Guerrini 2, Marco Mesiti 3, Paolo Rosso 4 1 Dip. di Informatica e Scienze dell Informazione, Università di Genova, Italy buscaldi@disi.unige.it,