A Framework for the Theoretical Evaluation of XML Retrieval

Size: px
Start display at page:

Download "A Framework for the Theoretical Evaluation of XML Retrieval"

Transcription

1 Journal of the American Society for Information Science and Technology A Framework for the Theoretical Evaluation of XML Retrieval Tobias Blanke King s College London Centre for e-research, Drury Lane WC2B 5RL, London, Uk tobias.blanke@kcl.ac.uk Mounia Lalmas Yahoo! Research Barcelona, Avinguda Diagonal Barcelona, Catalunya, Spain) mounia@acm.org Theo Huibers University of Twente Department of Computing Science, Drienerlolaan NB Enschede, The Netherlands theo.huibers@thaesis.nl This article presents a theoretical framework to evaluate XML retrieval. XML retrieval deals with retrieving those document components, the XML elements, that specifically answer a query. In this paper, theoretical evaluation is concerned with the formal representation of qualitative properties of retrieval models. It complements experimental methods by showing the properties of the underlying reasoning assumptions that decide when a document is about a query. We define a theoretical methodology based on the idea of aboutness and apply it to current XML retrieval models. This allows comparing and analyzing the reasoning behaviour of XML retrieval models experimented within the INEX evaluation campaigns. For each model we derive functional and qualitative properties that qualify its formal behaviour. We then use these properties to explain experimental results obtained with some of the XML retrieval models. Background and Motivation According to INEX, the evaluation initiative for XML retrieval (Gövert, Kazai, Fuhr, & Lalmas, 2006), the aim of XML retrieval is to retrieve not only relevant document components, but those at the right level of granularity, i.e. those that specifically answer a query. XML, contrary to HTML, separates the logical structure of documents from the layout. The logical structure of an XML document forms a tree of elements, which starts with a root element and has edges between elements. 1 Queries can also contain structural hints or be purely content-based. To evaluate how effective XML retrieval approaches use this structure to return specific answers, it is necessary to consider whether the right level of the structure is correctly identified. For this purpose, two dimensions of relevance have been used at INEX when evaluating XML retrieval effectiveness. The general relevance of an element (how relevant the information contained in the element is) is captured in the INEX exhaustivity dimension 2, whereas the specificity dimension indicates how focused an element is (the element does not contain non-relevant information). INEX uses these two dimensions to evaluate the effectiveness of the more complex task of retrieving structured documents, where the structure is represented through the XML mark-up. C. J. van Rijsbergen (1989) suggested that, given the increasing complexity of a retrieval task, as it is apparent in XML retrieval, an experimental approach to information retrieval (IR) should be complemented with a theoretical evaluation. In this sense, a theoretical evaluation can be complementary to an experimental evaluation if it helps to clarify the assumptions of retrieval models and if it can identify the characteristics leading to a particular experimental behaviour. Theoretical evaluation In this paper, the basis of our theoretical evaluation is the logical approach to IR (Huibers, 1996). In 1971, Cooper (1971) coined the term logical relevance for an objective view on relevance, in which the topical relation between document and query is considered. Van Rijsbergen and others have expressed logical relevance in terms of the implication d q, where d and q represent, respectively, the document and the query (C. J. van Rijsbergen, 1989). Chiaramella (2001) used two implications to describe structured document retrieval (XML retrieval is a special albeit most dominant application of structured document retrieval (Lalmas, 2009)): d q modelling exhaustivity and q d modelling specificity. Following Huibers formalism and approach, we call topical implications between query and document aboutness, where aboutness is described by formally deriving the reasoning process involved in IR models. 1 In this paper, elements and document components are used interchangeably. 2 Since 2006, INEX does not refer to exhaustivity anymore, just relevance and specificity. 1

2 2 As early as 1977, aboutness is discussed in (Hutchins, 1977), in the context of summarisation of texts and how manual indexers decide a document to be about a topic. Already then, aboutness is linked to topically and therefore an objective view on the content of documents. Bruza, Song, and Wong (2000) contribute an analysis of a common form of aboutness used in information retrieval and discuss how specific properties of aboutness can have an impact on experimental performance. Hjørland (2001), on the other hand, emphasises that a common-sense approach might not be sufficient and should be amended with an investigation into the epistemological view of aboutness, which takes into consideration the different views groups of people see as making a query to be about a document. Our work is less concerned about how users of information retrieval system derive aboutness but more with the question how these systems themselves decide aboutness. In this sense, it is related to the research in (Greisdorf & O Connor, 2003), which asks what are the underlying topical characteristics that link the output of an information retrieval system with a user s information need, or, as Greisdorf and O Connor (2003) put it, how it is deciced that the output of an IR system is on topic. We also use aboutness to describe the topical relation between a query a document, as an IR system sees it, and evaluate subsequently the performance of the system in a theoretical evaluation. We follow the approach in (C. J. van Rijsbergen & Lalmas, 1996) who have used aboutness to theoretically evaluate IR models, as have Bruza and Huibers (1994), and Wong, Song, Bruza, and Cheng (2001). They have all shown that an aboutness-based theoretical evaluation can be a powerful tool to evaluate flat document retrieval models. This paper extends the existing aboutness approaches by concentrating on XML retrieval models and the assumptions made in some successful XML retrieval models to decide aboutness using a combination of structure and content. In (Blanke & Lalmas, 2011), we have already applied an aboutness approach to analyse how XML retrieval models attempt to deliver only highly specific results. In this paper, we concentrate on explaining general performance results, i.e. what makes a document component about a query. The particular emphasis is the influence of the structure on the aboutness decision. More recent studies using a theoretical evaluation approach are the ones by Hui Fang, Cheng Xiang Zhai and Tao Tao (Fang, Tao, & Zhai, 2004) (Fang & Zhai, 2005). They present a formal study and a universal framework for the analysis of IR models, using a set of basic desirable constraints that any reasonable retrieval function should satisfy for good retrieval performance. At first sight, their approach looks similar to ours. However, they do not rely on a logical framework with their constraints but rely on the generalisation of intuitions mainly related to TF-IDF (term frequency and inverse document frequency) measures. These include the formalisation of a sensible interaction between TF-IDF: if given a fixed number of occurrences of query terms, a document that has more occurrences of discriminative terms (higher IDF) should achieve a better ranking. Such questions are interesting to anybody working on a new IR model. Their approach has advantages towards ours. Ours is more abstract and high-level, as we will see. As Fang and Zhai (2005) do not employ a high level of abstraction such as aboutness but remain within the parameters of standard measures to improve retrieval directly, immediate recommendations for further improvements of models can be made. This advantage is, however, also a disadvantage when it comes to the analysis of different and new retrieval tasks such as XML retrieval. Here, we do not yet have commonly agreed foundations. TF-IDF, for instance, is subject to discussions in XML retrieval, as, for example, it does not deal well with the problem of overlap in XML retrieval (Lalmas, 2009). We remain more abstract and therefore use an aboutnessbased approach. Furthermore, we use concepts from logic such as monotonic behaviour. To us, on a very abstract level, a relevance score is a function with various variables that include often terms and their frequency values, but also other parameters. We suggest to study aboutness rules and monotonicity and how they behave with respect to these variables. These will be the basis of our theoretical evaluation framework for XML retrieval. Theoretical evaluation framework Our theoretical evaluation methodology consists of three components, which are presented in the following sections. This section introduces the first component, a formalism with a translation to express aboutness symbolically, and the second component, which specifies aboutness by deriving rules of reasoning behaviour. The third section presents the third component, specific to XML retrieval, the pure type XML retrieval model to capture the influence of the XML structure on aboutness. In the forth section, all these components of the theoretical evaluation are applied to XML retrieval models, which were successful in the INEX evaluation. The fifth section, finally, uses the results of the theoretical evaluation to explain experimental results in INEX. The paper ends with our conclusions and insights on theoretical evaluation both in the context of XML retrieval and other areas in information retrieval. In this section, we systematically introduce the first two theoretical evaluation components, starting with the basic formalism in the next section. Basic Formalism Any theoretical evaluation methodology requires formalisms, complete enough to characterize the fundamental properties of retrieval models and complete enough to study their properties. Following (Bruza & Huibers, 1994), we use Situation Theory, developed by Barwise and Perry (1983), for this purpose. Situation Theory is a mathematical theory of meaning and information with situations as primitives (Huibers, 1996). Situations are partial descriptions of the world and are composed of infons. In the context of IR, queries and documents are modelled as situations,

3 FRAMEWORK FOR THE THEORETICAL EVALUATION OF XML RETRIEVAL 3 while infons represent information items such as keywords or phrases. The aboutness relation between two situations is represented with the symbol, using the same symbol as (Huibers, 1996). If we consider documents and queries to be situations, then D Q means that the information in the document represented by situation D is about the information need expressed in the query represented by situation Q. In standard IR models, a document containing garden and house would be about a query asking for garden. Query and document would share the term garden, and most IR models consider a document to be relevant to a query if they overlap in index terms. However, there might also be IR models that are not based on pure information overlap between query and document and would not consider D to be about Q in that case. A model could, for instance, limit aboutness to those cases, where there is a significant overlap in terms. We use this basic formalism to define the translation of a model as the first step in our theoretical evaluation. Translation Translation is the symbolic representation of the retrieval model s handling of information. It is described by a function map that maps information to its formal representation using the results of the indexing process. Infons are represented by k, where k stands for any indexed term or other information descriptor. A set of infons is a situation: { k 1, k 2 }. N-ary relationships R between infons i j are themselves infons and are modelled by R,i 1,...,i n. A simple example without relations is{ house, garden }. Later on, we cover a more complex example. Translation is closely linked to building a document information representation through indexing. According to (C. J. v. van Rijsbergen, 2004), with aboutness we come from the concrete notion that descriptors in indexes represent properties of documents. The simple example above demonstrates the translation of a simple document representation as a bag of terms. For XML retrieval, we need to add structural relationships to such simple representations in order to capture the XML structure. We define a translation to capture these later on. Throughout the paper, we use upper case letters from the middle of the alphabet such as S, T for situations. Anything these situations represent like keywords but later on also structured information is symbolized with letters from the beginning of the alphabet like A or B. Operators In addition to the definition of aboutness, we need operators that we can use to relate situations with one another. For instance, situations can be merged to form new situations. With, we formalise the composition of situations, e.g. { house } and { garden } can be combined to { house, garden }. Preclusion, symbolised by, expresses that information in situations clashes. They cannot be meaningfully combined such as { f lying birds, penguins }. If defined at all, preclusion describes mostly semantic relationships (Wong et al., 2001). However, most models have no notion of information clashes. states that two situations are equivalent, i.e. they contain the same information. According to the equivalence relationship, situations should be compared according to the meaning they bear not the names we give them. Lastly, containment describes that a situation contains at least the same information another one has. In Boolean retrieval this corresponds to, e.g., the implication that for any valid expression x y, x is also valid. Rules Aboutness is defined as a relationship between situations. In a theoretical evaluation framework, rules are used to define the reasoning aspects of this relationship. Rules are the logical representation of how a system decides a document to be about a query. The rules do not hold for all aboutness decisions but only for particular ones. Thus, an aboutness decision can be specified by the reasoning rules it incorporates. The aboutness decision can be further qualified by analysing how these reasoning rules are implemented by it: fully, conditionally or not at all. If the rules are fully supported they hold without conditions. If they are conditionally supported, they are generally supported, except for some special cases. For instance, aboutness decisions often do not consider any kind of overlap in a document s information with an information need to be sufficient to decide aboutness but only an information overlap of a particular size. As not all rules hold for all systems and in the same way, we can use them to deliver commonalities and differences between systems. By comparing the rules each system incorporates and the way it does so, we are able to give an overall comparison of the behaviour of retrieval systems. We use a subset of rules given by (Huibers, 1996) and by (Wong et al., 2001) to describe aboutness proof systems. We use those rules that in our experience best describe system performance in XML retrieval 3. These are generally based on the non-monotonic reasoning rules developed in (Kraus, Lehmann, & Magidor, 1990). For a complete analysis of the rules, please compare (Blanke, 2012), where we also motivate our choice of rules based on their effectiveness to explain experimental performance. In this paper, we concentrate for space reasons on the following rules: Reflexivity states that a situation S is about itself: S S. Symmetry states that if situation S is about another situation T (S T ), then also T S. Transitivity makes a transitive conclusion: If S T and T U, then also S U. Left Monotonic Union (LMU) states that if S T, then also S U T. LMU has a variant called Mix, which 3 There is an ongoing debate on which of the rules to use to best functionally describe aboutness systems. For a comparison see (Wong et al., 2001) and (Huibers, 1996).

4 4 adds a second assumption. With Mix, we state that if S U and T U, then also S T U. Right Monotonic Union (RMU) states that if S T, then also S T U. Let us discuss LMU in more detail. Huibers (1996) and Wong et al. (2001) both see the careful consideration of monotonicity as an important feature of IR. In an aboutness model unconditionally supporting LMU, a query containing house is not only about documents with house, but equally valid answers are components with house and garden. This means, aboutness decisions supporting LMU are not affected by document length. For systems supporting RMU, query expansion does not change the aboutness decision. This means that systems with RMU can expand the original query and gain a higher recall base while at the same time not losing what the original query was about. Both document component and query length are decisive aspects of aboutness decisions, which underlines the importance of monotonicity (Wong et al., 2001). The translation and the aboutness rules were developed by Huibers (1996) as parts of the theoretical evaluation of any retrieval model. In XML retrieval, we are in addition interested in how much a retrieval model uses structure to support its aboutness decision. Theoretically, we measure this by determining the qualitative reasoning distance of the XML retrieval model to its flat retrieval model equivalent (if there is one) and what we call pure type XML retrieval, which we introduce in the next section. Pure type XML Retrieval Pure types have been developed by the sociologist Max Weber (see (Weber, 1997 ( ))) and have proven to be useful tools for comparative studies. A pure type describes aspects of phenomena, but is not meant to describe perfect things nor all aspects of any one particular case. It is rather a purposeful emphasis of aspects common to most cases of the observed item. In our case, the emphasis is on the impact of XML structure on the aboutness behaviour. For each model we compare its reasoning behaviour with those of other models and look at its consideration of structure by determining its qualitative reasoning distance to the pure type model and the flat document model, the XML retrieval model it is based upon. The qualitative reasoning distance is described by the differences and commonalities in reasoning properties. The latter part has not been considered yet in aboutness investigations and is particularly useful for the theoretical evaluation of XML retrieval. As the pure type XML retrieval model is an aboutness system in itself, we now go though the steps of an aboutness based theoretical evaluation, as presented in the theoretical framework section, namely, aboutness, translations, operators and rules. Pure type aboutness The definition of aboutness for pure type XML retrieval is directly linked to the INEX view on exhaustivity and specificity, as developed in detail in (Blanke, 2012). As seen in the introduction, exhaustivity describes how far the document component contains all the information in a query, while specificity describes how little it is about other information than the one in the query. We use this description to define pure type aboutness: Say, that a document d is indexed in such a way that its XML structure is preserved. Then, it is about a query q according to the INEX view if structure and content information of q are contained in the structure and content of d. We represent this hierarchical inclusion with. To define, we use structure and content at the same time. Thus, to make this definition work for XML retrieval, we have to give a symbolic representation of situations that includes structure and content. We introduce this representation later on and give a Situation Theory formalism that allows us to represent indexing that preserves XML structure. But first we give two examples that illustrate the definition of hierarchical inclusion. In the following example, a section containing a paragraph about house and garden is generally a relevant answer to a query asking for a paragraph on house: Example paragraph house /paragraph section paragraph house, garden /paragraph /section In the example we use the standard notation for XML elements using tags (Lalmas, 2009) for element types. The paragraph about house and garden is, however, not a particularly specific answer to the same query as it contains additional unwanted information about garden. With hierarchical inclusion, a paragraph just containing house would be a fully specific answer to a query asking for a paragraph on house, as in the following example: Example paragraph house /paragraph section paragraph house /paragraph /section The two examples are both examples of hierarchical inclusion, as the information on the right hand side (document component) of the operator contains the information on the left hand side (query). The examples also illustrate how for hierarchical inclusion, structure is an equal component of aboutness. With pure type XML retrieval, we can compare the impact the XML structure has on the reasoning of different retrieval models. As discussed earlier, most XML retrieval models are extensions of models that work with flat and unstructured documents, for which content alone is important in their reasoning. With the pure type model, we create a reference model, for which XML structure is fully included in aboutness. In the next section we introduce a notation that allows us to use the power of set theory to perform reasoning that includes structure and content. This notation substitutes the direct XML representation of hierarchical inclusion using as seen in the two examples above and allows for simpler derivations of reasoning. Pure type translation For the pure type translation we make the assumption that in pure type XML retrieval the XML structure is preserved

5 FRAMEWORK FOR THE THEORETICAL EVALUATION OF XML RETRIEVAL 5 during indexing. We want to express the translation of a model that uses XML structure in its matching. To define the translation, we reuse the conceptual graph translation, as defined in (Huibers, Ounis, & Chevallet, 1996). Their conceptual graphs map well onto XML trees. The conceptual graph model has been developed in (Sowa, 1992) and analysed from an aboutness point of view by Huibers et al. (1996). In the model, a query q and a document d are both seen as conceptual graphs. The knowledge in the document collection is modelled as conceptual relationships between concept types. Content is found as referents to concept types. For XML, we consider these concept types to be element types, while we limit the set of relationships to the parent and the attribute relationship. Content is found in XML as part of element types. Based on these considerations, we define pure type map next. Definition of Map. In this section, we define a translation that preserves the XML structure. As in (Huibers et al., 1996), we want to represent a graph (in our case an XML tree) as a set of infons in order to preserve the XML structure. Intuitively, we can easily map a hierarchical organisation of information (in an XML tree) to sets, if we consider the parent elements to be the supersets of all sets of information that its children contain. We then also need a way of representing the relationship between parents and children in the same framework. Fortunately for us, we are considering sets of infons, which can either express content or relationships between content in the same formalism. Huibers et al. (1996) develop the approach we are mapping on to XML retrieval and state that a conceptual graph carries information and can be seen as a situation. We say that an XML element carries information and can be seen as a situation. For (Huibers et al., 1996), the conceptual graph situation is constituted of the concepts, referents and relations that define the information the conceptual graph carries. We need to consider instead element types, content in element types and parent and attribute relations. As Huibers et al. (1996) propose to translate each item of a conceptual graph (concept, referent and relation) into a specific infon, we propose to do the same with XML elements. Using element type, content and relation infons, we next define map for pure type XML retrieval. An XML tree consists of XML elements that have element types and associated content, which we refer to as values. These elements are connected with edges. We now suggest to translate XML elements together with their values into set of infons and to distinguish relational, content and element types. Let us assume that d is an XML document. Then, it can be translated into situations by using a map: For each XML element p of d with element type U, map has a result { ElementType,U, p }, where p is the unique parameter. Such an infon is called an element type infon. For each XML element p of d with a type U containing descriptors k 1 to k n, map is {map(u) Value,k 1, p,..., Value,k n, p }. p is the unique parameter that identifies U. k 1... k n is the set of n descriptors (for instance, index terms) that are values of the element type U. We call these infons value infons. is explained later on. Say R is a relation between two XML elements A and B in d. Let E 1 and E 2 be element types and { ElementType,E 1, p } map(a) and { ElementType,E 2,q } map(b). We can then say that map(r(ab))=map(a) { R, p,q } map(b). p is an identifier for E 1 and q for E 2. We call such an infon a relational infon, where its parameters are ordered, so that, for instance, { Parent,i 1,i 2, ElementType,Article, i 1, ElementType,Section,i 2 } reads as: Article is a parent of Section. Let us consider some examples. A paragraph about garden can be expressed (translated) as { ElementType, Paragraph, p, Value, garden, p } using an identifier p. A section with two paragraphs is translated into { ElementType,Section,s, Parent,s, p 1, Parent,s, p 2, ElementType,Paragraph,p 1, Value,garden, p 1, ElementType,Paragraph, p 2, Value,house, p 2 }. We use the relation Parent to express that the two paragraphs are the children of the section. We need to make sure that each XML document corresponds to exactly on set of infons and vice versa. To this end, we sketch a translation production algorithm leaving out the definition of the infon parameters, which can be easily deduced. We traverse the XML tree in a depth-first manner. Each time we visit an XML element we create an element type infon. If we reach a leaf we collect all the descriptors in the leaf and create a value infon for each of them. We then backtrack through the tree and while backtracking we create relational infons to connect the element type infons. Following this algorithm, we create a unique XML situation (set of infons) from each XML document. Furthermore, we can re-create the tree of the XML document bottom up, starting with the value infons to create the leaves and then reconnect the elements by following the relational infons using the element type infons to define the types of the elements. We only allow XML situations (set of XML infons), which lead to a valid XML document according to the definition by the W3C and therefore conform to the rules of a Document Type Definition (DTD) or an XML Schema (XSD) (in our case given by INEX). 4 Pure type operators To perform our aboutness reasoning for pure type XML retrieval, we need to define operators between XML situations. We use the same operators as defined earlier, but change these to make them work for XML situations. We need these operators to perform aboutness reasoning later in this paper. Here, we define equivalence and composition. Containment and preclusion can be defined similarly. 4 This can be proven by running it against the official W3C markup validation service:

6 6 Let us assume that we have two XML situations S and T, and a translation function map. Then: Two XML situations are equivalent if we can rename their identifiers and have an equivalent set, e.g.: ElementType,Section,i 1 ElementType,Section,i 2. S T map(a) map(b), where S map(a), T map(b) and A and B are XML documents. With this pure type translation, structure and content of XML documents come together in the same formalism. This makes it easier for us to draw conclusions about the reasoning incorporated in a specific XML retrieval system. By using to form larger situations, we can now employ the power of set theory to derive the aboutness reasoning, which allows us to reuse earlier results developed in (Huibers, 1996). Rules Now we show the reasoning rules of pure type XML retrieval. We first introduce an auxiliary proposition that greatly simplifies the proofs needed to analyze the pure type s reasoning. The proposition is based on our definitions for translation and aboutness decision and Huibers analysis of conceptual graphs in (Huibers, 1996): Proposition 1 For XML document components A and B, B A if and if only (or iff) map(a) map(b). Proof : Let B A. Then, we know that B has the content and the structure of a subdocument of A. Using the algorithm from page 5, we construct two situations S map(a) and T map(b). This means all relational, element types and value infons from T are also in S. Thus, map(a) map(b) using the parameter renaming defined in (Huibers et al., 1996). : Let map(a) map(b). Then, we know that map(b) has only infons also found in map(a). If we apply the algorithm to transform situations into XML documents from as shown on page 5, B needs to have the structure and content of a subdocument of A and is therefore hierarchically included in it: B A. The aim of our translation was to use the power of set theory in the aboutness proof. Proposition 1 verifies that we can do aboutness proofs with a relatively simple set operation on the representation (as situations) of XML trees. Using Proposition 1, we can now easily show that Reflexivity is given, as map(a) map(a) with S map(a). Transitivity holds. If S T and T U, then also S U. Say, that S map(a), T map(b), and U map(c). Then, map(a) map(b) as well as map(b) map(c). Thus, map(a) map(c). Symmetry is not given. From S T, we do not derive that T S. Again, S map(a) and T map(b). Then, map(a) map(b) is not equivalent to map(b) map(a). Now, we demonstrate that the aboutness system of pure type retrieval fully supports LMU. Given the assumption that a situation S is about another situation T (S T ), LMU offers the conclusion that also S U T, where U is a situation. Let us assume, that S map(a), T map(b) and S U map(c). The premise of LMU can then be rewritten as B A, which is implied by map(a) map(b) according to Proposition 1. With the definition of the translation, we also have map(c) map(a). Therefore: map(c) map(b). Thus, according to Proposition 1 also C B and LMU is fully supported. Mix is a special case of LMU and thus given, too. RMU, on the other hand, does not hold. Indeed, from S T, we cannot conclude S T U. In the next section, we use the pure type aboutness system to find out what role structure plays in actual XML retrieval models, such as those developed and evaluated during INEX. Theoretical evaluation of XML Retrieval Models In this section, we apply our proposed theoretical evaluation framework to evaluate XML retrieval models experimented with during INEX. We do not fully specify all the properties of these models, instead we concentrate on demonstrating important aspects of our theoretical evaluation approach that help us explain experimental behaviour. We begin with a model that builds upon existing flat document retrieval strategies XML language model. Afterwards, we investigate a model specifically designed for XML retrieval Gardens Point. For these two models, we look at the corresponding translation into formal situations, the reasoning rules, and the relationship to pure type XML retrieval. In a full theoretical evaluation, we would cover over 20 rules but for the purpose of this paper, we focus on Transitivity, Reflexivity, Symmetry and Monotonicity. These often form the basis of a theoretical evaluation, as we have shown in (Blanke & Lalmas, 2007). In related work (Blanke, 2012), we could identify these rules as those that have the stronguest impact on the experimental behaviour in INEX. The two models, on the other hand, have been chosen because of their behaviour in the experimental task we investigate later on. In (Blanke & Lalmas, 2011), we look at the performance of models in another INEX task, while Blanke (2012) offers the evaluation of five highly performing models in INEX (in terms of effectiveness). Language Models An XML retrieval model that is based on a model for flat document retrieval and that performed well at INEX is the language modelling described in (Sigurbjörnsson & Kamps, 2005). We refer to this model as the XML language modelling. A language model for each document component is calculated by interpolating the element (P mle (t i e)), the document (P mle (t i d)) and the collection (P mle (t i )) language models: P(t i e) = λ e P mle (t i e)+λ d P mle (t i d)+(1 λ e λ d ) P mle (t i ). This model is built on the decision that an element d is about a query q if and if only the information in q can be found in the element (its representation). We actually have several aboutness decisions depending on which elements are indexed, e.g. those above a given size, those that correspond

7 FRAMEWORK FOR THE THEORETICAL EVALUATION OF XML RETRIEVAL 7 to types frequently assessed as relevant, or some types of elements only. We therefore must provide a map function to translate infons for each chosen approach. We only demonstrate 2 of them, as the others are built very similarly. Let A be an element with term t and type e: Length based approach: map length (A) { ElementType, e, i, Value, t, i A > κ} where κ is a length threshold. Section based approach: map sec (A) { ElementType,e,i, Value,t,i e {Sec}} The main difference to a flat document language model is the division into document components instead of documents. This XML language model retrieval aboutness decision is the same as for the flat document language model: d about q if and if only P(t i e) > θ. In (Blanke, 2012), details of the threshold θ s dependency on the query are discussed. For now, it is enough to say that θ is based on the smoothing value, which is in any language model the lowest possible value for an element without query terms. It is internal to the aboutness decision, as it is dependent on the overall distribution of the terms in the collection. This allows the model to be adjusted well to specific collections (in our case, those used at INEX). We now continue with the analysis of the reasoning rules supported by our XML language modelling. To prove the reasoning rule, we use the proposition that P(t i e) > θ if there is an overlap in information between document component and query. We omit the proof here. Huibers (1996) demonstrated that retrieval systems based on this proposition, generally support Reflexivity, Symmetry and Transitivity reasoning. We therefore only demonstrate LMU. Say, that S map(a), T map(b) and S U map(c). Then, we have C B / /0 and C A. Thus, A B / /0, and LMU is unconditionally supported. Mix is a special case of Left Monotonic Union and is therefore also supported. Similarly to Left Monotonic Union, only those situations can be combined, which are part of the same index. This is interesting, as for XML retrieval parents and children are about the same queries and Mix should therefore be an automatic property, because it extends Left Monotonic Union. However, this is not the case for this model, where children and parents can be part of distinct indexes. This language modelling approach for XML retrieval is distinctively different from our pure type XML retrieval model. Structure is included in the aboutness decision by a priori dividing elements into several different indexes and not part of the aboutness reasoning itself. We will see in the section on the experimental evaluation results how this expresses itself in the experimental evaluation results. Gardens Point The Gardens Point Model (Geva, 2005) is a model that was specifically designed to fit the requirements of XML retrieval, by discriminating aboutness for leaf and branch elements. A leaf element is considered to be about a query if it contains at least one query term. A branch element is about a query if its subtree contains at least one leaf element that is about the query. A document component d is about a query q if rsv(d,q)>0. For leaf elements L, the scoring function is defined as rsv L = K n 1 n i=1 t i fi. Here, n is the number of unique query terms, t i is the frequency of the i-th query term in the leaf element and f i its collection frequency. Thus, rsv L favours document components with many unique query terms while penalizing query terms frequent in the collection. The weights of the leaf elements are propagated to form the weights for the branch elements. For a branch element B, the scoring function is given by rsv B = D(c) c i=1 rsv L i. c stands for the number of retrieved children elements. A decay factor D(c) is used to control the propagation, where D(c) = 0.49 for c=1 and D(c)=0.99 otherwise. In the Gardens Point model, the location of each term is identified by an absolute XPath expression (Geva, 2005). This means that the complete XML tree structure is preserved in the element representation. The translation is therefore the same as in the pure type XML retrieval model. However, hierarchical inclusion is not implemented, because Proposition 1 is not given. rsv(d, q) > 0 does not mean map(d) map(q), as the following example demonstrates. In the Gardens Point model, a leaf element containing house and garage is about a query asking for house. But according to Proposition 1, { ElementType,Paragraph,i 1, Value,House,i 1, Value,Garage,i 1 } is not about { ElementType, Paragraph,i 1, Value,House,i 1 }, as it has additional information about garages. According to (Geva, 2005), each term in an XML document is identified by three elements in the index: file path, absolute XPath context and term position within the XPath context. As a query language, however, Gardens Point uses NEXI (Geva, 2005), which is an (enhanced) subset of XPath. Content-only (Geva, 2005) queries are expressed as a search over the entire article element using NEXI. This leads to asymmetric reasoning behaviour. Say, we have a query //article[about(house)] that is about an element article[1]/bm[1]/bib[1]/bibl[1]/bb[13]/pp[1]. Then, we cannot use article[1]/bm[1]/bib[1]/ bibl[1]/bb[13]/pp[1] as a query again, as it is not a valid query expression. Thus, Gardens Point does not support the Symmetry rule. LMU would be a property of the aboutness systems, if with S T we could derive that S U T. Let us assume that S map(a), T map(b) and S U map(c). Thus, rsv(a,b) > 0. We are able to then also say that rsv(a,c) > 0. The sum in the calculation for leaf elements stay at least the same when adding new information. Sums in relevance calculation generally promote monotonic behaviour (Blanke & Lalmas, 2006). RMU on the other hand is not given, which again shows how close Gardens Point is to pure type XML retrieval. Elsewhere (Blanke, 2012), we have done a complete comparison of the reasoning properties with the pure type model and a flat document model derived from an aboutness decision purely based on rsv L. This comparison reveals that the model holds for almost the same set of rules as the pure type

8 8 model does. In our theoretical evaluation of XML retrieval models (limited to two models in this paper), we could see how XML retrieval models are concerned with controlling monotonic behaviour as well as other reasoning that support good performance in XML retrieval. As an example, we have discussed Symmetry. GPX does also not support right monotonic reasoning (RMU). RMU does not necessarily support better retrieval results. RMU allows us to conclude from the assumption S T that also S T U. However, in XML retrieval T might well include a structural condition. For instance, T alone might point to a section while T U might point to a paragraph within a section, which would completely change the aboutness relation. It is this kind of desirable behaviour that implies that XML retrieval systems should be able to change an aboutness decision if the XML context changes. The description of the monotonic reasoning behavior of XML retrieval models is key to the distinction with flat document retrieval. GPX compared to XML language modelling is closer to pure type XML retrieval especially in terms of the monotonic reasoning behaviour. This is one reason for its convincing behaviour in the experimental evaluation and why it outperforms XML language modelling in many experimental evaluation conditions. Theoretical Analysis of the INEX Experimental Evaluation Results In this section, we show how our proposed evaluation framework for XML retrieval is complementary to experimental results obtained at INEX, on the so-called thorough retrieval task for content-only (CO) queries (Gövert et al., 2006). We have concentrated on the INEX 2005 campaign because since then the fundamentals of the models for that task have not changed much. The thorough task is concerned with retrieving relevant elements for a given query, and to rank them according to their estimated relevance to that query. The thorough task is to be contrasted with the focused task, which is concerned with returning a list of nonoverlapping elements as answers to a given query. Overlap occurs when a document component (e.g. a section) and one of its descendent (e.g. a paragraph in this section) or ascendent (e.g. the chapter containing that section) are both returned as answers. In (Blanke & Lalmas, 2011), we presented a theoretical framework, also based on Situation Theory, to evaluate the focused task. The framework included a new theoretical evaluation methodology to evaluate filters, commonly used in INEX to focus the retrieval results on only the most specific XML elements. In this paper, we are interested in the underlying task to retrieve relevant XML elements in general. We proceed as follows. In the following parts, we introduce first some of the background of the effectiveness measures used to evaluate the thorough task. We analyse which of the aboutness reasoning properties promise to deliver good results under these particular measures. Afterwards, we use this analysis to explain experimental outcomes for the two models we have theoretically evaluated. Aboutness reasoning behind the INEX metrics We concentrate on the effort-precision/ gain-recall (ep gr) metrics used to evaluate the thorough task. Effortprecision (Kazai & Lalmas, 2006) is based on the amount of relative effort that a user has to make following a ranking of a system compared to the effort an ideal ranking would take. Though other measures are used by now in INEX, these measures were used during INEX 2005, and our theoretical framework is used to explain experimental results for that same year. Effort-precision ep is defined in (Kazai & Lalmas, 2006) by ep[r] = i ideal i run. i ideal is measured as the rank position at which the cumulated gain of r is reached by the ideal curve of the ranking. i run is measured by the same rank position in the real run. Let us assume that that three returned elements i 1, i 2 and i 3 have relevance scores of 2, 1, and 0, respectively. In an ideal system we would have the following ranking: {i 1,i 2,i 3 }. Assume that our system however returns {i 3,i 1,i 2 }. The cumulated gain of 2 would be reached by the ideal system at rank 1, while our system delivers it only at rank 3. This means ep[r] = 1/3. This demonstrates that ep[r] = 1 indicates a perfect performance and that scores are always between 0 and 1, as the ideal cumulated gain is always higher than or equal to the actual cumulated gain. In 2005, ep is captured at gain-recall points gr (Kazai & Lalmas, 2006), where gain-recall is calculated as the cumulated gain value divided by the total achievable cumulated gain. ep gr therefore measures where the most relevant XML elements are found in the ranking produced by a system. The more they are concentrated in the tail of the result list, the worse the performance. Secondly, ep gr measures how completely the set of most relevant elements is represented at the top of the ranking. A small number of highly relevant elements that appear at the tail of the ranking have a much stronger impact on the performance than, for instance, highly irrelevant elements at the head of the ranking list. This is the case as the denominator, which dominates ep[r], is the actual ranking result. If the actual ranking is much worse than the ideal ranking, ep[r] becomes very small. For a further discussion on the relationship between the cumulative gain measure and other standard information retrieval measure, readers are referred to (Carterette & Voorhees, 2011) and (McSherry & Najork, 2008). To understand how systems can develop a ranking that has at its top only the most relevant elements, we need to look at how systems can preserve aboutness. Our assumption is that those elements that are about a query are related in the information they have. Then, related relevant XML elements differ in how much relevant information they contain. According to our reasoning rules, a relevant element that has at least the same relevant information or more than another element can be found by either applying Left Monotonic Union or Mix reasoning. As defined in the section on the aboutness rules, LMU concludes that S U T, given S T. With respect to Mix, we can from the assumptions S T and

9 FRAMEWORK FOR THE THEORETICAL EVALUATION OF XML RETRIEVAL 9 Table 1 INEX thorough task evaluated using ep gr. Rank Model Run MAep 4 Language Model LMQrelbasedIndex Language Model LMLengthbasedIndex Language Model LMElementIndex Gardens Point GPX-1-Thorough Gardens Point GPX-2-Thorough U T conclude that also S U T. Looking at LMU, we already know from our discussions from the rules section, that the added information U does not necessarily have to be about the query. Systems, which support LMU, have therefore problems returning only highly relevant elements early in the ranking. Mix, however, provides a more conservative approach to monotonicity. Contrary to LMU, the added information is also about query. Therefore, Mix supports better performance in ep gr, while with LMU we need to be more careful. For all XML retrieval results under ep gr, we next look at how these left monotonic reasoning properties explain some of the experimental outcomes. Theoretical Analysis of Experimental Results Table 1 presents the results for the models discussed in this paper evaluated at INEX 2005 with ep gr. The rank column is the overall ranking of the model, named in the second column. The run is one of the submissions of the model, which we will describe later on. MAep stands for Mean Average effort precision, which is calculated by averaging the effortprecision values whenever a relevant XML element is found in the ranking (Fuhr, Lalmas, Malik, & Kazai, 2006). We can clearly see how the analysed models perform for these particular tasks. Generally speaking, both models perform well. Next, we explain these good performances and differences in the different runs submitted to INEX for these models. We rely on the left-monotonic reasoning rules, we have introduced. With respect to ep gr, XML language modelling (Sigurbjörnsson & Kamps, 2005) looks at reducing the number of indexing units with two special indexes only: one based on element length (run referred to as UAmsCOT LengthbasedIndex in Table 1) and another based on past relevance assessments Q rel (run referred to as UAmsCOT QrelbasedIndex in Table 1). These two indexes perform particularly well and come 4th and 5th in the overall assessment ranking (see Table 1). LMU is fully supported by the language modelling approach. This means that the aboutness decision is preserved across elements that share the same relevant information with smaller elements but also contain other information. This good performance through LMU reasoning is further fostered by the fact that Mix is fully supported, too. Looking at LMU, it is also no surprise that Q rel is the model s best submitted run with some distance (see Table 1). Here, the side effect of unwanted information added by LMU reasoning occurs less likely, as Q rel contains only those elements that according to the experience of previous INEX years are more likely than others to be regarded as relevant. The overall performance of Gardens Point is relatively worse for the thorough task (Table 1) than for others in INEX 2005 (Kazai & Lalmas, 2006). Though often the bestperforming model in INEX 2005, here it is outperformed by several other retrieval systems, although its overall performance is still very good (it still ranked among the 10 top models). According to our analysis of Gardens Point, the model differs among other things from language modelling and other models in its decay factor D(c). We can explain the model s experimental behaviour with the impact of D(c) on LMU and Mix. D(c) aims to control the impact of parent elements by putting more relevance onto the relevant children. However, these parent elements might have a better score and should therefore appear at the top according to the ideal ranking. If they do not appear at the top in the run, they reduce the overall performance significantly (as observed when using the measure ep gr). Using Mix, we can demonstrate this behaviour. Let us assume that we have a component S with a score of 3 for a query T and a component U with a score of 2. Then according to Mix, with S T and U T also their parent S U T. Let us further assume that without D(c) the score of S U would be 5, with D(c) it is 2.45, which reduces its rank in the actual run, increasing its distance from the ideal rank and therefore decreasing ep[r]= i ideal i run. A similar argument can be made using LMU twice, considering a highly relevant grandchild and child of a parent. As it reduces the scores of parents with highly relevant children, it is not surprising that Gardens Point s performance decreases. In summary, we used our theoretical evaluation results to compare the experimental ones for XML retrieval in order to find out how the adjustment of an existing flat document retrieval models compares to the creation of completely new one, especially designed to meet the requirements of XML retrieval. We analysed reasoning properties that support good performance in the thorough retrieval task under ep gr and determined reasoning properties that supported good performance for this task. The monotonic reasoning rules have played an important role here, as already seen for flat document retrieval. Conclusions and Discussions This paper presented a theoretical aboutness approach for an evaluation of XML retrieval models. To this end, we amended existing approaches and expanded the aboutness system to demonstrate the reasoning properties and under which conditions these properties are supported. Each of our theoretical evaluations goes through the same steps to define the characteristics of a particular XML retrieval model: translation and aboutness definition, analysis of rules and comparison with pure type XML retrieval. We proposed the

Specificity Aboutness in XML Retrieval

Specificity Aboutness in XML Retrieval Specificity Aboutness in XML Retrieval Tobias Blanke and Mounia Lalmas Department of Computing Science, University of Glasgow tobias.blanke@dcs.gla.ac.uk mounia@acm.org Abstract. This paper presents a

More information

The Utrecht Blend: Basic Ingredients for an XML Retrieval System

The Utrecht Blend: Basic Ingredients for an XML Retrieval System The Utrecht Blend: Basic Ingredients for an XML Retrieval System Roelof van Zwol Centre for Content and Knowledge Engineering Utrecht University Utrecht, the Netherlands roelof@cs.uu.nl Virginia Dignum

More information

Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom,

Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom, XML Retrieval Mounia Lalmas, Department of Computer Science, Queen Mary, University of London, United Kingdom, mounia@acm.org Andrew Trotman, Department of Computer Science, University of Otago, New Zealand,

More information

Focussed Structured Document Retrieval

Focussed Structured Document Retrieval Focussed Structured Document Retrieval Gabrialla Kazai, Mounia Lalmas and Thomas Roelleke Department of Computer Science, Queen Mary University of London, London E 4NS, England {gabs,mounia,thor}@dcs.qmul.ac.uk,

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

Using XML Logical Structure to Retrieve (Multimedia) Objects

Using XML Logical Structure to Retrieve (Multimedia) Objects Using XML Logical Structure to Retrieve (Multimedia) Objects Zhigang Kong and Mounia Lalmas Queen Mary, University of London {cskzg,mounia}@dcs.qmul.ac.uk Abstract. This paper investigates the use of the

More information

Processing Structural Constraints

Processing Structural Constraints SYNONYMS None Processing Structural Constraints Andrew Trotman Department of Computer Science University of Otago Dunedin New Zealand DEFINITION When searching unstructured plain-text the user is limited

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 13 Structured Text Retrieval with Mounia Lalmas Introduction Structuring Power Early Text Retrieval Models Evaluation Query Languages Structured Text Retrieval, Modern

More information

Formulating XML-IR Queries

Formulating XML-IR Queries Alan Woodley Faculty of Information Technology, Queensland University of Technology PO Box 2434. Brisbane Q 4001, Australia ap.woodley@student.qut.edu.au Abstract: XML information retrieval systems differ

More information

Consistency and Set Intersection

Consistency and Set Intersection Consistency and Set Intersection Yuanlin Zhang and Roland H.C. Yap National University of Singapore 3 Science Drive 2, Singapore {zhangyl,ryap}@comp.nus.edu.sg Abstract We propose a new framework to study

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

The Interpretation of CAS

The Interpretation of CAS The Interpretation of CAS Andrew Trotman 1 and Mounia Lalmas 2 1 Department of Computer Science, University of Otago, Dunedin, New Zealand andrew@cs.otago.ac.nz, 2 Department of Computer Science, Queen

More information

Binary Decision Diagrams

Binary Decision Diagrams Logic and roof Hilary 2016 James Worrell Binary Decision Diagrams A propositional formula is determined up to logical equivalence by its truth table. If the formula has n variables then its truth table

More information

Propositional Logic. Part I

Propositional Logic. Part I Part I Propositional Logic 1 Classical Logic and the Material Conditional 1.1 Introduction 1.1.1 The first purpose of this chapter is to review classical propositional logic, including semantic tableaux.

More information

THE weighting functions of information retrieval [1], [2]

THE weighting functions of information retrieval [1], [2] A Comparative Study of MySQL Functions for XML Element Retrieval Chuleerat Jaruskulchai, Member, IAENG, and Tanakorn Wichaiwong, Member, IAENG Abstract Due to the ever increasing information available

More information

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson

XML RETRIEVAL. Introduction to Information Retrieval CS 150 Donald J. Patterson Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Manning, Raghavan, and Schütze http://www.informationretrieval.org OVERVIEW Introduction Basic XML Concepts Challenges

More information

The Chinese University of Hong Kong, Shatin, N.T., Hong Kong 2 Distributed Systems Technology Center, Building 78, Staff House Road

The Chinese University of Hong Kong, Shatin, N.T., Hong Kong 2 Distributed Systems Technology Center, Building 78, Staff House Road From: FLIRS-00 Proceedings. opyright 2000, I (www.aaai.org). ll rights reserved. Fundamental Properties of the ore Matching Functions for Information Retrieval D.W. Song 1 K.F. Wong 1 P.D. ruza 2.H. heng

More information

Articulating Information Needs in

Articulating Information Needs in Articulating Information Needs in XML Query Languages Jaap Kamps, Maarten Marx, Maarten de Rijke and Borkur Sigurbjornsson Gonçalo Antunes Motivation Users have increased access to documents with additional

More information

Phrase Detection in the Wikipedia

Phrase Detection in the Wikipedia Phrase Detection in the Wikipedia Miro Lehtonen 1 and Antoine Doucet 1,2 1 Department of Computer Science P. O. Box 68 (Gustaf Hällströmin katu 2b) FI 00014 University of Helsinki Finland {Miro.Lehtonen,Antoine.Doucet}

More information

Accessing XML documents: The INEX initiative. Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen)

Accessing XML documents: The INEX initiative. Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen) Accessing XML documents: The INEX initiative Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen) XML documents Book Chapters Sections World Wide Web This is only only another

More information

Passage Retrieval and other XML-Retrieval Tasks. Andrew Trotman (Otago) Shlomo Geva (QUT)

Passage Retrieval and other XML-Retrieval Tasks. Andrew Trotman (Otago) Shlomo Geva (QUT) Passage Retrieval and other XML-Retrieval Tasks Andrew Trotman (Otago) Shlomo Geva (QUT) Passage Retrieval Information Retrieval Information retrieval (IR) is the science of searching for information in

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

The Heterogeneous Collection Track at INEX 2006

The Heterogeneous Collection Track at INEX 2006 The Heterogeneous Collection Track at INEX 2006 Ingo Frommholz 1 and Ray Larson 2 1 University of Duisburg-Essen Duisburg, Germany ingo.frommholz@uni-due.de 2 University of California Berkeley, California

More information

CS152: Programming Languages. Lecture 11 STLC Extensions and Related Topics. Dan Grossman Spring 2011

CS152: Programming Languages. Lecture 11 STLC Extensions and Related Topics. Dan Grossman Spring 2011 CS152: Programming Languages Lecture 11 STLC Extensions and Related Topics Dan Grossman Spring 2011 Review e ::= λx. e x e e c v ::= λx. e c τ ::= int τ τ Γ ::= Γ, x : τ (λx. e) v e[v/x] e 1 e 1 e 1 e

More information

Reading 1 : Introduction

Reading 1 : Introduction CS/Math 240: Introduction to Discrete Mathematics Fall 2015 Instructors: Beck Hasti and Gautam Prakriya Reading 1 : Introduction Welcome to CS 240, an introduction to discrete mathematics. This reading

More information

The Effect of Structured Queries and Selective Indexing on XML Retrieval

The Effect of Structured Queries and Selective Indexing on XML Retrieval The Effect of Structured Queries and Selective Indexing on XML Retrieval Börkur Sigurbjörnsson 1 and Jaap Kamps 1,2 1 ISLA, Faculty of Science, University of Amsterdam 2 Archives and Information Studies,

More information

Inheritance Metrics: What do they Measure?

Inheritance Metrics: What do they Measure? Inheritance Metrics: What do they Measure? G. Sri Krishna and Rushikesh K. Joshi Department of Computer Science and Engineering Indian Institute of Technology Bombay Mumbai, 400 076, India Email:{srikrishna,rkj}@cse.iitb.ac.in

More information

Semantics via Syntax. f (4) = if define f (x) =2 x + 55.

Semantics via Syntax. f (4) = if define f (x) =2 x + 55. 1 Semantics via Syntax The specification of a programming language starts with its syntax. As every programmer knows, the syntax of a language comes in the shape of a variant of a BNF (Backus-Naur Form)

More information

CADIAL Search Engine at INEX

CADIAL Search Engine at INEX CADIAL Search Engine at INEX Jure Mijić 1, Marie-Francine Moens 2, and Bojana Dalbelo Bašić 1 1 Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, 10000 Zagreb, Croatia {jure.mijic,bojana.dalbelo}@fer.hr

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

A Voting Method for XML Retrieval

A Voting Method for XML Retrieval A Voting Method for XML Retrieval Gilles Hubert 1 IRIT/SIG-EVI, 118 route de Narbonne, 31062 Toulouse cedex 4 2 ERT34, Institut Universitaire de Formation des Maîtres, 56 av. de l URSS, 31400 Toulouse

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Evaluation Metrics. Jovan Pehcevski INRIA Rocquencourt, France

Evaluation Metrics. Jovan Pehcevski INRIA Rocquencourt, France Evaluation Metrics Jovan Pehcevski INRIA Rocquencourt, France jovan.pehcevski@inria.fr Benjamin Piwowarski Yahoo! Research Latin America bpiwowar@yahoo-inc.com SYNONYMS Performance metrics; Evaluation

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

A Universal Model for XML Information Retrieval

A Universal Model for XML Information Retrieval A Universal Model for XML Information Retrieval Maria Izabel M. Azevedo 1, Lucas Pantuza Amorim 2, and Nívio Ziviani 3 1 Department of Computer Science, State University of Montes Claros, Montes Claros,

More information

On k-dimensional Balanced Binary Trees*

On k-dimensional Balanced Binary Trees* journal of computer and system sciences 52, 328348 (1996) article no. 0025 On k-dimensional Balanced Binary Trees* Vijay K. Vaishnavi Department of Computer Information Systems, Georgia State University,

More information

An approach to the model-based fragmentation and relational storage of XML-documents

An approach to the model-based fragmentation and relational storage of XML-documents An approach to the model-based fragmentation and relational storage of XML-documents Christian Süß Fakultät für Mathematik und Informatik, Universität Passau, D-94030 Passau, Germany Abstract A flexible

More information

A probabilistic description-oriented approach for categorising Web documents

A probabilistic description-oriented approach for categorising Web documents A probabilistic description-oriented approach for categorising Web documents Norbert Gövert Mounia Lalmas Norbert Fuhr University of Dortmund {goevert,mounia,fuhr}@ls6.cs.uni-dortmund.de Abstract The automatic

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

The overlap problem in content-oriented XML retrieval evaluation

The overlap problem in content-oriented XML retrieval evaluation The overlap problem in content-oriented XML retrieval evaluation Gabriella Kazai Queen Mary University of London London, E1 4NS UK gabs@dcs.qmul.ac.uk Mounia Lalmas Queen Mary University of London London,

More information

3.4 Deduction and Evaluation: Tools Conditional-Equational Logic

3.4 Deduction and Evaluation: Tools Conditional-Equational Logic 3.4 Deduction and Evaluation: Tools 3.4.1 Conditional-Equational Logic The general definition of a formal specification from above was based on the existence of a precisely defined semantics for the syntax

More information

Exploiting Index Pruning Methods for Clustering XML Collections

Exploiting Index Pruning Methods for Clustering XML Collections Exploiting Index Pruning Methods for Clustering XML Collections Ismail Sengor Altingovde, Duygu Atilgan and Özgür Ulusoy Department of Computer Engineering, Bilkent University, Ankara, Turkey {ismaila,

More information

The Structure of Bull-Free Perfect Graphs

The Structure of Bull-Free Perfect Graphs The Structure of Bull-Free Perfect Graphs Maria Chudnovsky and Irena Penev Columbia University, New York, NY 10027 USA May 18, 2012 Abstract The bull is a graph consisting of a triangle and two vertex-disjoint

More information

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML

Plan for today. CS276B Text Retrieval and Mining Winter Vector spaces and XML. Text-centric XML retrieval. Vector spaces and XML CS276B Text Retrieval and Mining Winter 2005 Plan for today Vector space approaches to XML retrieval Evaluating text-centric retrieval Lecture 15 Text-centric XML retrieval Documents marked up as XML E.g.,

More information

Reliability Tests for the XCG and inex-2002 Metrics

Reliability Tests for the XCG and inex-2002 Metrics Reliability Tests for the XCG and inex-2002 Metrics Gabriella Kazai 1, Mounia Lalmas 1, and Arjen de Vries 2 1 Dept. of Computer Science, Queen Mary University of London, London, UK {gabs, mounia}@dcs.qmul.ac.uk

More information

Optimization I : Brute force and Greedy strategy

Optimization I : Brute force and Greedy strategy Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean

More information

Core Membership Computation for Succinct Representations of Coalitional Games

Core Membership Computation for Succinct Representations of Coalitional Games Core Membership Computation for Succinct Representations of Coalitional Games Xi Alice Gao May 11, 2009 Abstract In this paper, I compare and contrast two formal results on the computational complexity

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

6. Relational Algebra (Part II)

6. Relational Algebra (Part II) 6. Relational Algebra (Part II) 6.1. Introduction In the previous chapter, we introduced relational algebra as a fundamental model of relational database manipulation. In particular, we defined and discussed

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Composability Test of BOM based models using Petri Nets

Composability Test of BOM based models using Petri Nets I. Mahmood, R. Ayani, V. Vlassov and F. Moradi 7 Composability Test of BOM based models using Petri Nets Imran Mahmood 1, Rassul Ayani 1, Vladimir Vlassov 1, and Farshad Moradi 2 1 Royal Institute of Technology

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

A model of navigation history

A model of navigation history A model of navigation history Connor G. Brewster Alan Jeffrey August, 6 arxiv:68.5v [cs.se] 8 Aug 6 Abstract: Navigation has been a core component of the web since its inception: users and scripts can

More information

1. Introduction. Queensland University of Technolngy, Brisbane, QLD, Anstralia bruza~icis.qut.edu.au. Abstract

1. Introduction. Queensland University of Technolngy, Brisbane, QLD, Anstralia bruza~icis.qut.edu.au. Abstract Towards Functional Benchmarking of Information Retrieval Models From: Proceedings of the Twelfth International FLAIRS Conference. Copyright 1999, AAAI (www.aaai.org). All rights reserved. D.W. Song I K.F.

More information

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees.

Throughout the chapter, we will assume that the reader is familiar with the basics of phylogenetic trees. Chapter 7 SUPERTREE ALGORITHMS FOR NESTED TAXA Philip Daniel and Charles Semple Abstract: Keywords: Most supertree algorithms combine collections of rooted phylogenetic trees with overlapping leaf sets

More information

Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database

Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database Jovan Pehcevski, James Thom, Anne-Marie Vercoustre To cite this version: Jovan Pehcevski, James Thom, Anne-Marie Vercoustre.

More information

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4 Query Languages Berlin Chen 2005 Reference: 1. Modern Information Retrieval, chapter 4 Data retrieval Pattern-based querying The Kinds of Queries Retrieve docs that contains (or exactly match) the objects

More information

1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth

1 Leaffix Scan, Rootfix Scan, Tree Size, and Depth Lecture 17 Graph Contraction I: Tree Contraction Parallel and Sequential Data Structures and Algorithms, 15-210 (Spring 2012) Lectured by Kanat Tangwongsan March 20, 2012 In this lecture, we will explore

More information

Efficient pebbling for list traversal synopses

Efficient pebbling for list traversal synopses Efficient pebbling for list traversal synopses Yossi Matias Ely Porat Tel Aviv University Bar-Ilan University & Tel Aviv University Abstract 1 Introduction 1.1 Applications Consider a program P running

More information

Evaluating the eectiveness of content-oriented XML retrieval methods

Evaluating the eectiveness of content-oriented XML retrieval methods Evaluating the eectiveness of content-oriented XML retrieval methods Norbert Gövert (norbert.goevert@uni-dortmund.de) University of Dortmund, Germany Norbert Fuhr (fuhr@uni-duisburg.de) University of Duisburg-Essen,

More information

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 27-1 Slide 27-1 Chapter 27 XML: Extensible Markup Language Chapter Outline Introduction Structured, Semi structured, and Unstructured Data. XML Hierarchical (Tree) Data Model. XML Documents, DTD, and XML Schema.

More information

Random Oracles - OAEP

Random Oracles - OAEP Random Oracles - OAEP Anatoliy Gliberman, Dmitry Zontov, Patrick Nordahl September 23, 2004 Reading Overview There are two papers presented this week. The first paper, Random Oracles are Practical: A Paradigm

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Heading-Based Sectional Hierarchy Identification for HTML Documents

Heading-Based Sectional Hierarchy Identification for HTML Documents Heading-Based Sectional Hierarchy Identification for HTML Documents 1 Dept. of Computer Engineering, Boğaziçi University, Bebek, İstanbul, 34342, Turkey F. Canan Pembe 1,2 and Tunga Güngör 1 2 Dept. of

More information

Semantic Tableau Prover for Modal Logic

Semantic Tableau Prover for Modal Logic University of Manchester Third Year Project 2015 Semantic Tableau Prover for Modal Logic Student: Silviu Adi Pruteanu Supervisor: Dr. Renate Schmidt 1/34 Abstract This project report documents the extension

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach

Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Category Theory in Ontology Research: Concrete Gain from an Abstract Approach Markus Krötzsch Pascal Hitzler Marc Ehrig York Sure Institute AIFB, University of Karlsruhe, Germany; {mak,hitzler,ehrig,sure}@aifb.uni-karlsruhe.de

More information

Score Region Algebra: Building a Transparent XML-IR Database

Score Region Algebra: Building a Transparent XML-IR Database Vojkan Mihajlović Henk Ernst Blok Djoerd Hiemstra Peter M. G. Apers Score Region Algebra: Building a Transparent XML-IR Database Centre for Telematics and Information Technology (CTIT) Faculty of Electrical

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Adding Term Weight into Boolean Query and Ranking Facility to Improve the Boolean Retrieval Model

Adding Term Weight into Boolean Query and Ranking Facility to Improve the Boolean Retrieval Model Adding Term Weight into Boolean Query and Ranking Facility to Improve the Boolean Retrieval Model Jiayi Wu University of Windsor There are two major shortcomings of the Boolean Retrieval Model that has

More information

Chapter S:V. V. Formal Properties of A*

Chapter S:V. V. Formal Properties of A* Chapter S:V V. Formal Properties of A* Properties of Search Space Graphs Auxiliary Concepts Roadmap Completeness of A* Admissibility of A* Efficiency of A* Monotone Heuristic Functions S:V-1 Formal Properties

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Fully dynamic algorithm for recognition and modular decomposition of permutation graphs

Fully dynamic algorithm for recognition and modular decomposition of permutation graphs Fully dynamic algorithm for recognition and modular decomposition of permutation graphs Christophe Crespelle Christophe Paul CNRS - Département Informatique, LIRMM, Montpellier {crespell,paul}@lirmm.fr

More information

Chapter 3. Set Theory. 3.1 What is a Set?

Chapter 3. Set Theory. 3.1 What is a Set? Chapter 3 Set Theory 3.1 What is a Set? A set is a well-defined collection of objects called elements or members of the set. Here, well-defined means accurately and unambiguously stated or described. Any

More information

SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay. Lecture #10 Process Modelling DFD, Function Decomp (Part 2)

SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay. Lecture #10 Process Modelling DFD, Function Decomp (Part 2) SOFTWARE ENGINEERING Prof.N.L.Sarda Computer Science & Engineering IIT Bombay Lecture #10 Process Modelling DFD, Function Decomp (Part 2) Let us continue with the data modeling topic. So far we have seen

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

DESIGN AND IMPLEMENTATION OF TOOL FOR CONVERTING A RELATIONAL DATABASE INTO AN XML DOCUMENT: A REVIEW

DESIGN AND IMPLEMENTATION OF TOOL FOR CONVERTING A RELATIONAL DATABASE INTO AN XML DOCUMENT: A REVIEW DESIGN AND IMPLEMENTATION OF TOOL FOR CONVERTING A RELATIONAL DATABASE INTO AN XML DOCUMENT: A REVIEW Sunayana Kohli Masters of Technology, Department of Computer Science, Manav Rachna College of Engineering,

More information

Chapter 13 XML: Extensible Markup Language

Chapter 13 XML: Extensible Markup Language Chapter 13 XML: Extensible Markup Language - Internet applications provide Web interfaces to databases (data sources) - Three-tier architecture Client V Application Programs Webserver V Database Server

More information

Fast algorithms for max independent set

Fast algorithms for max independent set Fast algorithms for max independent set N. Bourgeois 1 B. Escoffier 1 V. Th. Paschos 1 J.M.M. van Rooij 2 1 LAMSADE, CNRS and Université Paris-Dauphine, France {bourgeois,escoffier,paschos}@lamsade.dauphine.fr

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Recoloring k-degenerate graphs

Recoloring k-degenerate graphs Recoloring k-degenerate graphs Jozef Jirásek jirasekjozef@gmailcom Pavel Klavík pavel@klavikcz May 2, 2008 bstract This article presents several methods of transforming a given correct coloring of a k-degenerate

More information

V Advanced Data Structures

V Advanced Data Structures V Advanced Data Structures B-Trees Fibonacci Heaps 18 B-Trees B-trees are similar to RBTs, but they are better at minimizing disk I/O operations Many database systems use B-trees, or variants of them,

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 10: XML Retrieval Hinrich Schütze, Christina Lioma Center for Information and Language Processing, University of Munich 2010-07-12

More information

CSCI 403: Databases 13 - Functional Dependencies and Normalization

CSCI 403: Databases 13 - Functional Dependencies and Normalization CSCI 403: Databases 13 - Functional Dependencies and Normalization Introduction The point of this lecture material is to discuss some objective measures of the goodness of a database schema. The method

More information

Component ranking and Automatic Query Refinement for XML Retrieval

Component ranking and Automatic Query Refinement for XML Retrieval Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge

More information

The SNPR neighbourhood of tree-child networks

The SNPR neighbourhood of tree-child networks Journal of Graph Algorithms and Applications http://jgaa.info/ vol. 22, no. 2, pp. 29 55 (2018) DOI: 10.7155/jgaa.00472 The SNPR neighbourhood of tree-child networks Jonathan Klawitter Department of Computer

More information

Analysis of Algorithms

Analysis of Algorithms Algorithm An algorithm is a procedure or formula for solving a problem, based on conducting a sequence of specified actions. A computer program can be viewed as an elaborate algorithm. In mathematics and

More information

Algebra of Sets. Aditya Ghosh. April 6, 2018 It is recommended that while reading it, sit with a pen and a paper.

Algebra of Sets. Aditya Ghosh. April 6, 2018 It is recommended that while reading it, sit with a pen and a paper. Algebra of Sets Aditya Ghosh April 6, 2018 It is recommended that while reading it, sit with a pen and a paper. 1 The Basics This article is only about the algebra of sets, and does not deal with the foundations

More information

Foundations of Computer Science Spring Mathematical Preliminaries

Foundations of Computer Science Spring Mathematical Preliminaries Foundations of Computer Science Spring 2017 Equivalence Relation, Recursive Definition, and Mathematical Induction Mathematical Preliminaries Mohammad Ashiqur Rahman Department of Computer Science College

More information

Slides for Faculty Oxford University Press All rights reserved.

Slides for Faculty Oxford University Press All rights reserved. Oxford University Press 2013 Slides for Faculty Assistance Preliminaries Author: Vivek Kulkarni vivek_kulkarni@yahoo.com Outline Following topics are covered in the slides: Basic concepts, namely, symbols,

More information

Geometric Crossover for Sets, Multisets and Partitions

Geometric Crossover for Sets, Multisets and Partitions Geometric Crossover for Sets, Multisets and Partitions Alberto Moraglio and Riccardo Poli Department of Computer Science, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, UK {amoragn, rpoli}@essex.ac.uk

More information

One-Point Geometric Crossover

One-Point Geometric Crossover One-Point Geometric Crossover Alberto Moraglio School of Computing and Center for Reasoning, University of Kent, Canterbury, UK A.Moraglio@kent.ac.uk Abstract. Uniform crossover for binary strings has

More information

Review. CS152: Programming Languages. Lecture 11 STLC Extensions and Related Topics. Let bindings (CBV) Adding Stuff. Booleans and Conditionals

Review. CS152: Programming Languages. Lecture 11 STLC Extensions and Related Topics. Let bindings (CBV) Adding Stuff. Booleans and Conditionals Review CS152: Programming Languages Lecture 11 STLC Extensions and Related Topics e ::= λx. e x ee c v ::= λx. e c (λx. e) v e[v/x] e 1 e 2 e 1 e 2 τ ::= int τ τ Γ ::= Γ,x : τ e 2 e 2 ve 2 ve 2 e[e /x]:

More information

Automatically Generating Queries for Prior Art Search

Automatically Generating Queries for Prior Art Search Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation

More information

Relevance in XML Retrieval: The User Perspective

Relevance in XML Retrieval: The User Perspective Relevance in XML Retrieval: The User Perspective Jovan Pehcevski School of CS & IT RMIT University Melbourne, Australia jovanp@cs.rmit.edu.au ABSTRACT A realistic measure of relevance is necessary for

More information

== is a decent equivalence

== is a decent equivalence Table of standard equiences 30/57 372 TABLES FOR PART I Propositional Logic Lecture 2 (Chapter 7) September 9, 2016 Equiences for connectives Commutativity: Associativity: P Q == Q P, (P Q) R == P (Q R),

More information

Parallel Computation: Many computations at once, we measure parallel time and amount of hardware. (time measure, hardware measure)

Parallel Computation: Many computations at once, we measure parallel time and amount of hardware. (time measure, hardware measure) CMPSCI 601: Recall From Last Time Lecture 24 Parallel Computation: Many computations at once, we measure parallel time and amount of hardware. Models: (time measure, hardware measure) Parallel RAM: number

More information