Information Retrieval from Structured Documents Represented by Attribute Grammars

Size: px

Start display at page:

Download "Information Retrieval from Structured Documents Represented by Attribute Grammars"

Clarence Morris
6 years ago
Views:

1 Abstract Information Retrieval from Structured Documents Represented by Attribute Grammars Alda Lopes Gançarski * alda.lopes@lip6.fr Pedro Rangel Henriques ** prh@di.uminho.pt This paper presents a system for Information Retrieval (IR) from collections of structured documents represented by Attribute Grammars (AG). Each document corresponds to a syntactic tree with nodes decorated with sets of attributes. The values of these attributes correspond to characteristics which specify the semantics of the textual content and the structure in order to perform IR. First, we show how to generate such grammars. Then, we explain how the AG analyser constructs document representations to obtain a ranked list of the result to the queries. The relevance of the document components is estimated using textual and structural information. This information is given by a set of heuristics or, alternatively, is explicitly introduced into the system by a specific document. 1. Introduction It has been recently a tremendous growth of the specification of structured textual information using the standards Standard Generalised Markup Language (SGML), Hypertext Markup Language (HTML) and extensible Markup Language (XML). When a collection of documents share the kind of information they deal with, it is natural to think about describing that information in the same way for all of them. For example, business letters have almost all the same structure, so it is desirable that this structure would be described in a standard way. These ideas were at the origins of a standard for specifying markup languages, the SGML, as a format of exchanging documents. The specification of the structural elements and their hierarchical relations for a given type of documents is made through the Document Type Definition (DTD). Later, the HTML was developed as an SGML application to show the documents in the Web. This standard makes use of mainly presentation marks to describe how the textual parts must be displayed by the browser. More recently, XML appeared as a simplification of SGML adapted to exchange documents in the Web. XML documents have a description richer than the simple but limited one provided by HTML. We can take advantage of the markup information to represent structured documents in a standard way in order to extract results from them, whatever the system application, by means of a Attribute Grammar (AG) ([Gan01]). Not only it is a wide known and rather simple concept, but it also allows to uniformly represent heterogeneous sources of structured information, avoiding to develop new intermediate specifications or languages for specialised tasks. A Attribute Grammar (AG) consists of a context independent grammar extended by a set of attributes (and rules for their calculation) which specify the semantics of the analysed texts. If necessary, it also allows to impose contextual conditions to productions, based on attribute values. The result of the syntactic and semantic analysis of a text is an abstract syntax * Pierre et Marie Curie University, Laboratoire d Informatique de Paris 6, Paris, France. ** University of Minho, Department of Informatics, Braga, Portugal.

2 tree decorated with the attribute values (DAST). When an attribute value is calculated by the production where the respective symbol is derived (the symbol is on the left-hand side of the production) it is called synthesised attribute; otherwise, it is called inherited. Synthesised attributes make the propagation of the semantic information from the leaves up to the root of the syntax tree, while the inherited ones do the same but from the root down to the leaves or between sibling nodes. A AG in which all the attributes can be evaluated by visiting the nodes of the syntactic tree in a single left-to-right, depth-first traversal is said to be L-attributed [Sco00]. In this case, the semantic analysis can be interleaved with the syntactic analysis. The task we address here is Information Retrieval (IR). IR consists of retrieving from a collection the relevant documents to a query, while returning as few as possible of non-relevant documents. Moreover, the resulting documents should be ranked by their relevance to the query. In classical IR, a document is represented by the set of terms which are considered to describe the meaning of the text. The selection of those terms is called automatic indexing and consists of some operations done on the text, like stemming and elimination of non-informative words (stopwords). One possible textual representation is a vector in which the components store a measure of how representative is each term with respect to the meaning of the text. This measure is based both on the frequency of the term in a document and the number of documents that contain the term. The relevance of a document to a query is usually computed by the cosine between the vector that represents the text and the one that represents the query. In order to speed up the search when performing query processing, appropriate data structures are build over the text, called indexes [BYRN99]. Among several techniques, we chosed inverted files which are currently the best choice for most applications. An inverted file is a word-oriented mechanism composed of two elements: the vocabulary and the occurrences. The vocabulary is the set of different terms in the text. For each term, a list of all the text positions where the term appears is stored. The set of all lists is called the occurrences file. With the emerge of structured documents, the need appeared to construct an index for the structure where each element occurrence is associated with its father. To take advantage of the structural information of the documents, IR research evolved, first, retrieving documents taking into account the structural parts relevance ([Wil94]); then, enriching the query formats with structural information to retrieve certain parts of documents ([NBY95]); more recently, some works tried to establish the relevance ranking ([Lal97], [WFC99], [HTK00], [SN00]), but it is still an opened research area. We intend to retrieve parts of documents and to establish the relevance ranking of the results. For that, we use the structure information together with the textual content to make the relevance calculations. The information about the structure is expressed by a set of heuristics based on the analysis of different types of structured documents. Our system is composed of different modules, as shown in Figure 1. Using the DTD, the AG Generator module will automatically generate a specific AG, as we show in the section 2. The AG Analyser (section 3) makes a syntactic and semantic analysis of an input document to create the corresponding DAST. The textual contents and structure information from the document are used as arguments of various external functions which compute the attribute values (for example, a function can compute the frequency of a term in a text). As an alternative to the heuristics about the structure information (section 4), a document called Information about Structure for IR (ISIR) can be used by external functions to calculate the relevance of the document components (section 5). A query is a pair (ElemType, textual restriction) which expresses the type of element to return and a textual restriction over it (expressed as a natural language query). The result of analysing one document is the ranked list of the desired elements. After analysing all the documents, the resulting lists can be merged to have a final ranked list of all the relevant element occurrences. This result can be given to the

3 user in several ways; for example, each occurrence may have a pointer to the document it belongs to. DTD AG Generator Text and Structure External Functions Structured Document AG Analyser Attribute Values Structured Query DAST ISIR Fig. 1 - The IR system. 2. The AG Generator To represent a structured document by a DAST, we need to specify both the syntactic and the semantic components of the AG. The syntactic part of the AG corresponds to the structure of the document. Production rules can be automatically constructed from the element declarations using a pre-defined set of generic mapping rules. Each rule corresponds to a type of element declaration or a type of operator in the regular expressions of the element contents: For an element (ElemType) declared as a sequence of N components (Comp_1,, Comp_N) in the following DTD declaration: <!ELEMENT ElemType - - (Comp_1,, Comp_N)>, the corresponding production of the AG is: ElemType Comp_1 Comp_N. For an element declared as an alternative of components: <!ELEMENT ElemType - - (Comp_1 Comp_N)>, a new production of the AG is generated for each component: ElemType Comp_i, i=1..n. If a component (Comp) is formed by another regular expression, the productions of the AG vary according to the operator of the expression. For sequential expressions and alternative expressions, the productions are, respectively: Comp Comp_1 Comp_N and Comp Comp_i, i=1..n, where Comp is a new non-terminal symbol. For one or more occurrences of a component (Comp +), zero or more occurrences (Comp *) and zero or one occurrence (Comp?), a new symbol is created (NewSymbol) and the set of productions are, respectively: (Comp +) (Comp *) (Comp?) NewSymbol Comp NewSymbol NewSymbol Comp NewSymbol Comp NewSymbol NewSymbol ε 1 NewSymbol Comp NewSymbol ε The new symbols of the grammar (Comp and NewSymbol) do not have meaning, they are just syntactic constructs, so we will ignore them in the example below. The generated AG productions are, then, augmented with the semantic attribute calculations. The basic set of attributes for IR is: - identifier - the unique identifier for the element occurrence; 1 Empty production.

4 - elemtype - the type of an element; - textrep - the vector representation of a textual component; - relevance - the relevance of an element to the query; - select - a binary value to detect the elements of the type specified in the query; - oldranklist - the ranked list of relevant elements inherited from the partial DAST constructed until the present element; it is initialised with an empty list in the production of the root symbol; - newranklist - the ranked list of relevant elements updated with the information coming from the partial DAST with the present symbol as root. We show now a simple example composed by a DTD which describes mail letters, a corresponding instance, a query and the final DAST representation of the instance (Figure 2). DTD: <!DOCTYPE Letter SYSTEM [ <!ELEMENT Letter - - (Subject, Head, Message)> <!ELEMENT Subject - - (#PCDATA)> <!ELEMENT Head - - (#PCDATA)> <!ELEMENT Message - - (Paragraph+)> <!ELEMENT Paragraph - - (#PCDATA)>]> Instance:<Letter> <Subject>Vacations</Subject> <Head>14 Nov 2000, Dear Marianne: </Head> <Message> </Message> </Letter> <Paragraph> I am in vacations in Europe!<Paragraph> <Paragraph> I visited Portugal and Spain. Now I am in Paris, a city with lots of monuments and interesting things to see! Then I plan to visit you. Will you be there?</paragraph> <Paragraph>I hope to enjoy the rest of my vacations! I will visit nice places as much as possible! I whish you good vacations too!</paragraph> The receiver of this message wants paragraphs talking about vacations (query=(paragraph, vacations )). In the DAST, the synthesised attributes appear on the right hand side of the derivation arrows and the inherited ones, i.e. the ones that depend on attributes coming from the ascendants, on the left hand side. For better understanding, the relevance attribute has a very simple calculation function: it is the frequency of the textual restrictions in the textual content, here the frequency of the term vacations. Each symbol corresponding to an element is assigned a unique natural number identifier. The select attribute has always the value 0, except for the occurrences of the desired element (Paragraph). The old ranked list of occurrences of the desired element (oldranklist) for the symbol Letter is an empty list since there is no partial DAST constructed until that node. Then, it is inherited by Subject (for clarity, the evolution of the list is indicated with dotted arrows). The corresponding updated list (newranklist) remains empty because Subject is not the desired element. This list is inherited by Head and remains empty because Head is still not the desired element. The same empty list is passed to Message and then to the first Paragraph in the oldranklist attribute. The attribute select of this paragraph has the value 1. Thus, a relevance value of 1 is stored in relevance, which means that there is one occurrence of the term vacations. The new ranked list becomes a single list with this paragraph s identifier (5). In the following paragraph, this list is not altered since there are no occurrences of the word vacations. The last Paragraph has a relevance value of 2; consequently, the head of the ranked list becomes the identifier of this

5 paragraph (7, 5). This list is now passed up to Message and then to Letter. The final result of the IR process over the instance is the newranklist attribute of the root symbol, i.e., (7, 5), which means that the Paragraph identified by 7 ( I hope vacations too! ) is the most relevant to the query, followed by the Paragraph identified by 5 ( I am in vacations in Europe! ). Letter identifier = 1 elemtype=letter oldranklist = ( ) relevance =4 select = 0 ranklist = (7, 5) Subject Head Message identifier=4 elemtype=message oldranklist=() identifier=2 elemtype=subject identifier=3 elemtype=head oldranklist=() textrep= oldranklist=() textrep=... relevance=1 relevance=0 select=0 select=0 ranklist=( ) ranklist = ( ) Vacations 14 Nov 2000, Dear Marianne relevance=3 select=0 ranklist =(7, 5) Paragraph Paragraph Paragraph elemtype=paragraph elemtype=paragraph elemtype=paragraph identifier = 5 textrep= identifier=6 textrep= identifier=7 textrep= oldranklist=() select=1 oldranklist=(5) select=1 oldranklist=(5) select = 1 relevance=1 relevance=0 relevance =2 ranklist=(5) ranklist=(5) ranklist =(7, 5) I am in vacations in Europe! I visited be there? I wish you good vacations! Bye! Fig. 2 : The DAST representation of a structured document. 3. The AG Analyser To represent a document by a DAST, it is analysed syntactically and semantically. Since our AG is L-attributed, both the information concerning the position of an element in the syntactic tree and its attributes are computed and stored in the same step during the syntactic analysis. Before starting the user queries session, a partial DAST is created with the queryindependent attributes. This consists of saving the syntactic tree in a structure index, making the automatic indexation and frequencies calculations to produce the vector representation of the text and calculating other fixed attributes, like the element type. Figure 3 shows the text index (vocabulary and occurrences) and the structure index. In the vocabulary, each term (term) is associated with the number of elements where it appears (efreq, needed for the vector representation of the text) and a pointer to the correspondent list of occurrences in the occurrences file (occptr). For each occurrence, the system stores the element identifier where the term occurs (elemid) together with its frequency in the same element (tfreq, also needed for the textual representation). The structure index has the following information about each

6 element (elemid): its parent (parent), its type (type) and the maximum frequency of a term in its textual component (tfmax, still needed for the textual representation). When the user starts the queries, the system uses the previously created indexes to calculate the attributes related with the relevance of the document components. This attributes are calculated for each element following the order in which the elements appear in the structure index, i.e. the same as in the syntactic analysis (left-to-right, depth-first). In the example of section 3, the order is <2, 3, 5, 6, 7, 4, 1>. Following this order, a recursive algorithm maintains, for each element, a stack with the relevance of each of its components in order to calculate its own relevance. The same algorithm keeps the ranked list of the relevant elements in a vector and augments it with the elemid of each new relevant selected element. Dictionary Occurrences Structure Index term efreq occptr elemid tfreq elemid parent type tfmax Fig. 3 : The text index (vocabulary and occurrences) and the structure index of the system. 4. Information about Structure for IR In IR, only textual information is used to calculate the relevance of a document to a query. In our work, we want to improve the relevance evaluation using also structural information in two ways: allowing the user to specify parts of documents in the query; and developing the following heuristics based on the analysis of different types of structured documents: 1. Basically, a markup can be for presentation (procedural markup, such as Bold or Italic tags), for logical organization of the content (descriptive structural markup, such as Chapter and Section tags) or it can correspond to concepts defining the content (descriptive nominal markup, such as Car or Hotel tags) [Pit]. In the latter case, the name of the tags should be indexed because they are information that can be asked directly by the user (when, for instance, he wants documents about hotels for vacations). 2. The textual elements (elements with just a #PCDATA content) of a sequence are usually more related to each other than other sibling elements up in the structure. It is thus likely that, if one of them is relevant to the query, the majority of them are too. This is the case of a sequence of Line elements in a paragraph. 3. A textual component before a list of elements can be interpreted as a context for those elements. This is the case of an introduction of a chapter with respect to its sections. 4. The elements can be categorized in different levels of importance with respect to the relevance of a document (or an ascendant). Typically, the element Title of a scientific article has a much greater importance than a Footnote element. The system should be able to detect the existence of titles, headers or other tags which are easily assigned with a relative importance value. An alternative or a complement to the above heuristics, is the explicit introduction of that information in a particular document called Information about Structure for IR (ISIR). The specification of a ISIR is given to the system together with the DTD of the documents collection (recall Figure 1). We propose the following XML syntax for the ISIR: <!DOCTYPE ISIR [ <!ELEMENT ISIR (NominalMarkup?, Sequence?, Context*, Importance?)>

7 <!ELEMENT NominalMarkup (ElemType+)> <!ELEMENT ElemType (#PCDATA)> <!ELEMENT Sequence (TextType+)> <!ELEMENT TextType (#PCDATA)> <!ELEMENT Context (TextType, TextTypeList)> <!ELEMENT TextTypeList (TextType+)> <!ELEMENT Importance (ElemType, Weight)+> <!ELEMENT Weight (NUM)>]> Here, NominalMarkup specifies the tags that must be indexed; Sequence refers to a sequence of strongly related textual components (given by a sequence of TextType tags); Context defines a contextual relation between a textual component and a list of textual components (TextTypeList); Importance defines the attribution of a weight Weight to the element type ElemType. The element types are the elements declared in the DTD of the documents collection. The textual component types correspond to the different possible paths from the root to a textual content specified in the DTD: each time a #PCDATA content occurs, it is assigned a TextType unique type. A ISIR document looks like: <ISIR> <NominalMarkup> <ElemType> </ElemType> </NominalMarkup> <Sequence> <TextType> </TextType> </Sequence> <Context> <TextType> </TextType> <TextTypeList> <TextType> </TextType> </TextTypeList> </Context> <Importance><ElemType> </ElemType><Weight> </Weight>... </Importance> </ISIR> 5. The Relevance Calculation Relevance calculation is performed by dedicated functions of the External Functions module. To process a query, it is first necessary to identify the required elements and, eventually, the elements where the textual restriction will be evaluated. Let TextOcc be the occurrence of a textual component of type TextType in a document. The relevance of such a component, with respect to a textual restriction, is calculated by a function RelText of the type of the component, its textual representation (TextRep) and the structural information given by the heuristics or the ISIR: Relevance(TextOcc) = RelText(TextType, TextRep, Heuristics/ISIR). To show the role of the information from the heuristics (or the ISIR) in the calculation of this relevance, recall the example of section 3. If the content of Subject was taken into account as a context to the paragraphs (heuristic 3), and these ones were considered to be related to each other (heuristic 2), then the second paragraph would have a relevance value different from 0 (despites it does not have the term vacations), entering in the final ranked list. This is correct in the sense that this paragraph, like the others, also talks about the vacations of the sender and probably would interest the receiver. After analysing the relevance of the textual components of a document, the required elements ranked by their relevance are given to the user. The system calculates this relevance for all the elements in the path from the textual components up to the required elements. The calculation is done in a simplistic way by the average of the relevances of the N components (Comp n ) of an

8 element type that occurs in a document (ElemOcc). For textual components, this relevance is given by the function RelText; for structural components, it is weighted by the respective importance (Importance). We assume the importance of a textual component equal to the highest importance of the sibling elements. Thus, we have: Relevance(ElemOcc) = [Σ N Relevance(Comp n )*Importance n ] / Σ N Importance n 6. Conclusions and Future Work We have presented a system to make IR over collections of structured documents represented in DASTs. We extend the traditional IR approach, based on textual representations only, with heuristics about the structure in order to improve the quality of the relevance estimation. We proposed also the possibility of substituting the heuristics by a ISIR document with precise information concerning the structure of the documents. As future work, we intend to: (1) Analyse further the information about the structure that we can get from different types of structured documents to extend the ISIR and improve the retrieval quality. (2) Finish the implementation of the system with the approximation of the function RelText using statistical methods. For that we will seek for appropriate training sets, i.e. labelled structured documents collections and respective queries. Then, we will test the retrieval quality of the system calculating appropriate standard IR measures (namely Precision and Recall) and compare the results with the ones given using a simple textual representation and the cosine similarity. (3) Extend the AG specification with XML attributes, entities, hyperlinks and other multimedia information in the automatic generation of the AG. We believe that their semantic can improve the IR task. (4) Extend the system to accept more flexible queries. Financial Support: This work is granted by the Sub-Programa Ciência e Tecnologia do 2 Quadro Comunitário de Apoio of the "Fundação para a Ciência e a Tecnologia (FCT). References [Gan01] A. L. Gançarski, Using Attribute Grammars to Uniformly Represent Structured Documents Application to Information Retrieval, Third Network of Excellence Workshop on Interoperability and Mediation in Heterogeneous Digital Libraries, Darmstadt, Germany, [HTK00] Y. Hayashi, J. Tomota and G. Kikui, Searching text-rich XML documents with relevance ranking, ACM SIGIR'00 Workshop on XML and Information Retrieval, Athens, Greece, [Lal97] M. Lalmas, A dempster-shafer theory of evidence applied to structured documents: Modelling uncertainty, Research and Development in Information Retrieval, p , [NBY95] G. Navarro and R. Baeza-Yates, A language for queries on structure and contents of textual databases, ACM SIGIR 95, [Pit] D. V. Pitt, Standard Generalised Markup Language and the Transformation of Cataloging, [Sco00] M. L. Scott, Programming Languages Pragmatics, Morgan Kaufmann Publishers, San Francisco, California, [SN00] T. Schlieder and F. Naumann, Approximate tree embeeding for querying XML data, ACM SIGIR 2000 Workshop on XML and Information Retrieval, Athens, Greece, [WFC99] J. E. Wolff, H. Florke and A. B. Cremers, Xpres: a ranking approach to retrieval on structured documents, Technical report, University of Bonn, Germany, [Wil94] R. Wilkinson, Effective retrieval of structured documents, ACM SIGIR 94, 1994.

Using Attribute Grammars to Uniformly Represent Structured Documents - Application to Information Retrieval

Using Attribute Grammars to Uniformly Represent Structured Documents - Application to Information Retrieval Alda Lopes Gançarski Pierre et Marie Curie University, Laboratoire d Informatique de Paris 6,