QUICK: Queries Using Inferred Concepts from Keywords

Size: px
Start display at page:

Download "QUICK: Queries Using Inferred Concepts from Keywords"

Transcription

1 QUICK: Queries Using Inferred Concepts from Keywords Technical Report CS Jeffrey Pound, Ihab F. Ilyas, and Grant Weddell David R. Cheriton School of Computer Science University of Waterloo Waterloo, Canada Abstract We present QUICK, an entity-based text search engine that blends keyword search with structured query processing over rich knowledge bases (KB) with massive schemas. We introduce a new formalism for structured queries based on keywords that combines the flexibility of keyword search and the expressiveness of structures queries. We propose a solution to the resulting disambiguation problem caused by introducing keywords as primitives in a structured query language. We show how expressions in our proposed language can be rewritten using the vocabulary of a given KB. The rewritten query describes an arbitrary topic of interest for which corresponding documents are retrieved. We also introduce a method for indexing that allows efficient search over large KB schemas in order to find sets of relevant entities. We then cross-reference the entities with their occurrences in a corpus and rank the resulting document set to produce the top-k relevant documents. An extensive experimental study demonstrates the viability of our approach. 1 Introduction The World Wide Web hoards massive amounts of unstructured text data. Recent efforts to extract structured data sets from unstructured and semi-structured Web documents have resulted in the creation of massive knowledge bases: structured data sets encoding rich schematic information including millions of entities. As an example, both ExDB [2] and YAGO [21], a large knowledge base derived from Wikipedia and WordNet, have schema items numbering in the millions, and fact collections numbering in the hundreds and tens of millions respectfully. At this scale, writing structured queries can be a daunting task as the sheer magnitude of the information available to express the query is overwhelming. We call this the information overload problem. 1

2 q(x) :- german (x), has won (x, y), nobel award (y) < German Language = j ff < NOBEL PRIZE = hasweight q(x) :- GERMAN PEOPLE : ; (x), (x, y), AWARD haswonprize : ; (y). German Alphabet Turing Award Figure 1: An ambiguous conjunctive query and the possible interpretations. The first and third predicates are an example of matching on keyword occurrence, while the second predicate is an example of matching on edit distance. To deal with this problem, and also open a new form of exploratory search, recent work has brought keyword query processing to structured data models, such as keyword search over relational databases. While keyword search over structured data models alleviates the information overload problem caused by massive schemas, it comes at a loss of expressivity. Users can no longer express desired structure in the query, and can no longer take advantage of schema information. As an example, consider a user who wants to find all people of German nationality who have won a Nobel award. The user may pose the following conjunctive query to an information system. q(x) :- GERMAN PEOPLE(x), haswonprize(x, y), NOBEL PRIZE(y). (1) If the user is not familiar with the schema of the underlying information system (i.e., they do not know the labels of the relations in order to formulate a well formed query), then they may alternatively issue the following keyword query. german has won nobel award (2) This query searches for all data items with a syntactic occurrence of the given keywords. There are two important differences between the structured query and the keyword variant. First, the structured query will be processed by making use of explicit semantics encoded as an internal schema. In particular, the query will find data items that are German, rather than those which contain the keyword german. Second, the keyword query will look for data items with any syntactic occurrence of nobel award, while the structured query uses this an explicit selection predicate over the qualifying entities, exploiting the structure of the query to find qualifying results. This example illustrates how query languages can vary in their expressiveness and ease-of-use. On one end of the spectrum, structured query languages, such as SQL, provide a mechanism to express complex information needs over a schema. A user of such a language must have intimate knowledge of the underlying schema in order to formulate well formed queries for arbitrary information needs. On the other end of the spectrum are free form query languages such as keyword queries. These languages allow users to express information needs with little to no knowledge of any schematic constructs in the underlying information system, giving ease-of-use as well as useful exploratory functionality to the user. 2

3 Consider the design of a query language that falls somewhere in the middle of this spectrum. We would ideally like to retain the expressive structure of structured queries, while incorporating the flexibility of keyword queries. One way to achieve this is to embed ambiguity into the structured query language by allowing keywords or regular expressions to take the place of entities or relations. In this setting the query from the previous example may be written using the structure in (1), but with keywords as in (2). q(x) :- german (x), has won (x, y), nobel award (y). In this model each keyword phrase will be replaced, by the query processor, with a set of candidate schema items based on some metric of syntactic matching, such as keyword occurrence or edit distance as depicted in Figure 1. The advantage of this approach is in the flexibility of how queries can be expressed. Users need not have intimate knowledge of the underlying schema to formulate queries as is necessary with traditional structured query languages. At the same time, the query retains explicit structure that gives greatly increased expressiveness over flat keyword queries. The problem with this approach is that the number of possible (disambiguated) queries is exponential in the size of the query. This is problematic as many of the possible matchings may be meaningless with respect to the underlying schema, or may not represent the users actual intention. Processing every possible match is not only inefficient, as many of the possible query interpretations may have empty result sets, it can overload the user with many unwanted results from unintended interpretations. For example, the query for all GERMAN PEOPLE who have won an AWARD (the super class of all awards) will produce many results that obstruct the user from finding those of interest (those who have won a NOBEL PRIZE). Note that syntactic matching alone could not possibly be sufficient as a disambiguation solution in all cases, as the keyword german is a natural choice to express both a nationality and a language. This phenomena, known as polysemy, is just one of the challenges of disambiguation. Similar problems arise with synonymy. In this paper, we investigate a solution to the problem of querying over rich massive schemas by introducing a structured query language that builds upon keywords as its most basic operator, while taking disambiguation steps before query evaluation in order to avoid the exponential blow-up caused by keyword ambiguity. Our query language allows a natural keyword-based description of entity-relationship queries over large knowledge bases without intimate background knowledge of the underlying schema. We also show how inference over a rich schema defined by a reference knowledge base, such as those extracted by YAGO or ExDB, can be efficiently incorporated into the entity search system by introducing a method for indexing that allows query time inference. Our index incorporates the traditional notions of inverted indexing over documents, allowing us to extend our entity search functionality to document retrieval, and incorporate full text search. 3

4 1.1 Contributions In this work, we make the following contributions. We propose a keyword-based structured query language that allows users to create expressive descriptions of their information need without having detailed knowledge of the underlying schema. We analyze the consequences of introducing ambiguity into a structured query language (by means of keywords as base level constructs), and propose a model for disambiguation. We also show how query terms can be used as keywords to integrate full text search in the cases where full disambiguation can not be achieved. We introduce a compact scalable index for inference enabled entity search that is more space efficient than the current state-of-the-art approaches, with competitive performance in query processing. We also demonstrate the viability of our proposed approach with an experimental evaluation of a prototype implementation of the complete system. 1.2 Outline The remainder of the paper is organized as follows. Section 2 reviews relevant background information. Section 3 gives an overview of our problem and proposed architecture. Section 4 presents our structure keyword query language, with the disambiguation problem explored in Section 5. Section 6 then presents our index structure and details query processing of disambiguated queries. In Section 7 we show experimental results of a prototype implementation. Section 8 surveys and compares related work with our approach and Section 9 concludes. 2 Background Entity Search Entity search is a variant of traditional information retrieval (IR) in which entities (e.g., Albert Einstein, University of Waterloo, etc..) extracted from text documents form the most basic data-level construct, as opposed to documents in traditional IR. This view of the search problem opens up a variety of new query processing opportunities, in particular, the ability to process more structured types of queries over the entities and their relations to one another becomes feasible (e.g., [2, 4, 10, 11]). In contrast to traditional IR, query processing in this model may aggregate data from multiple documents to determine which entities are part of a query result. In these models, structured or unstructured query languages can be used to describe entities of interest, and data can reside in a structured model or in text documents. The common facet is that entities form the primitives for search. As an example, consider an entity-based search system over text documents. Given a query for documents containing the entity Max Planck, the system 4

5 would retrieve all documents mentioning Max Planck, and not the documents which mention the Max Planck Institute or the Max Planck Medal (an award in physics). Having entities as primitives reduces ambiguity in the semantics of the search, which avoids problems common to traditional keyword search systems. Also, information can be aggregated from different sources to answer a query, such as a search for all documents containing entities bornin(canada). Documents about qualifying entities will be returned independent of the location of the information that encodes where they were born. Semantic Search Ontology-based semantic search is a similar problem to entity search in which entities and relations form the base level data constructs. In these systems, rich semantics defined by ontologies are taken into consideration during query processing (or as a preprocessing phase). However, semantic search systems generally operate over structured semantic data, such as RDF or OWL [7, 17], fact triples [13], or XML [6] (with explicit references to the ontology concepts). A variant of the semantic search problem is to support searching over the semantics of entities found in text documents. Again these models may make use of structured or unstructured query languages. The common thread in semantic search systems is the support for inference over a reference ontology. As an example, consider a semantic search system over text documents. A search for documents containing entities of type GUITARIST would return documents about B. B. King, Jimi Hendrix, and other entities known to be guitarists. These documents are query results independent of whether or not they contain a syntactic occurrence of the word guitarist. Thus, the search is over the semantics of the entities, rather than the syntax. Knowledge Representation The field of knowledge representation is concerned with formalisms for encoding knowledge with clear semantics to support inference over the encoded information. A collection of schematic information, which forms general rules about a domain of interest, is called an ontology. Concrete instance information, or entities, describe explicit information about particular entities in the domain, such as Albert Einstein. A knowledge base is a general term for a domain ontology and instance data. As an example, an ontology might define relationships among abstract primitive concepts, such as the relationship PHYSICIST is-a SCIENTIST. Ontologies may also include arbitrary relations, such as haswonprize, to encode information like NOBEL LAUREATE is-a PERSON that has a haswonprize relation to some NOBEL PRIZE. The encoding of the knowledge base information and the associated semantics rest on an underlying knowledge representation formalism. 5

6 3 Problem Definition and Solution Overview 3.1 Preliminaries We assume access to a domain related knowledge base and document corpus. These data sources are preprocessed to: (1) extract entities from the text (used for entity search over documents), (2) compute statistics over the knowledge base (used in disambiguation), and (3) index the knowledge base s content and structure (used in query processing), as well as the document text (used to integrate full text search). The schema language for the underlying knowledge bases assumed in our work includes entities, primitive concepts, conjunctions, existential relations, and acyclic concept hierarchies. This is roughly equivalent to the EL dialect of description logics, and is sufficient to capture many real world knowledge bases and ontologies such as YAGO [21] and SNOMED CT 1, an ontology of medical terms. The basic knowledge base terminology used in our discussion, as well as our notational conventions, are as follows. Entities are written in a typed font with leading capitals. Entities denote constant terms such as people, places, and organizations (e.g., Albert Einstein, Canada). Primitive Concepts are written in a typed font in all capitals. Primitive concepts are types or classes to which entities may belong (e.g., SCIENTIST, COUNTRY). Relations are written in italic camel caps. Relations are binary relationships that can exist between entities and/or primitive concepts (e.g., has- WonPrize or bornin). There is also a transitive is-a relation used to encode hierarchies among concepts. A general concept is any number of entities or primitive concepts joined by logical connectives such as conjunctions and relations (note that relations are implicitly existentially qualified). We use, to denote a conjunction. (e.g. SCIENTIST, haswonprize(nobel Prize in Physics)). We also adopt the convention of writing keywords in quoted italic lower case (e.g., scientist ). 3.2 Problem Definition The problem we address is to provide expressive yet flexible access to document collections. We aim to make use of entity-based retrieval and incorporate inference over a rich schema defined by a reference KB. We address this problem in the following way. 1 SNOMED CT actually uses an extended version of EL called EL++ which includes hierarchies among relations and transitive relations. Our proposed system would need to be extended to handle inference over these constructs. 6

7 Figure 2: System architecture. We develop a keyword-based structured query language that blends the expressiveness of structured queries with the flexibility of keyword queries (Section 4). Next, we address the problem of ambiguity in our query language caused by introducing keywords as primitives. We propose an efficient approach for disambiguating queries in our language to produce a ranked set of possible query interpretations using the vocabulary of an underlying schema (Section 5). We then focus our attention to the problem of efficiently evaluating the disambiguated queries over large rich schemas. To achieve this, we encode KB structure and content in a compact index structure that allows us to efficiently find all entities that satisfy a disambiguated query in our language. Our index builds on traditional notions of inverted indexing, allowing us to easily incorporate entity-based document retrieval and full text search (Section 6). Lastly, we address the issue of ranking relevant documents by incorporating established mechanisms from the IR community for document ranking based on relevance of keywords to documents. 3.3 System Overview Our system architecture is illustrated in Figure 2. The solid lined box on top shows the run-time components, with data flow represented as solid lined arrows. The grey dotted lined box shows the preprocessing phase, with preprocessor data flow represented as dotted lined arrows. The bold typed components in the diagram show the parts of the system which we focus on in this paper, namely, the query language, query disambiguation, indexing, and query processing. We use generic solutions to support the remaining components of the 7

8 system (document ranking and information extraction). The preprocessor consists of two components, the information extraction system and the indexing module. These two components make use of two domain related data sources, a knowledge base and a document corpus. The preprocessing phase proceeds as follows. First the information extraction tool is run over the document corpus to mine structured information from the text. This information includes extracted named entities as well as relationships among the entities. Facts that were previously unknown to the knowledge base can be contributed (note that we omit this learning phase in our implementation and discussion as the focus of this paper is not on the extraction component). The named entities are then sent to the indexing mechanism to build an inverted index over entities and documents. The documents text is also indexed directly to support full text search. In the second preprocessing phase, the structure of the knowledge base is indexed to support efficient entity search within the knowledge base. It is important to note that the entity search module must be capable of handling the expressivity of the constructs used in the underlying knowledge base. The final preprocessing step is to compute statistics over the knowledge base which will be used during query disambiguation. At run time, an arbitrary keyword-based structured query is taken as input to the system. Because the query is based on keywords, there is an inherent ambiguity as to the semantics of the query (i.e., the user s intention). We compute a ranked set of possible query disambiguations based on the vocabulary of the knowledge base. Each disambiguated query is thus a general concept over the underlying knowledge base. One or more of these disambiguated queries can then be evaluated to find relevant entities and their corresponding documents. The documents are then returned to the user ranked by some relevance metric. 4 Keyword-based Queries We now introduce our keyword-based structured query language. Definition 1 (Document Query) A document query DQ is given by the following. DQ ::= Q where Q i is a structured keyword query. (Q 1 ), (Q 2 ),..., (Q n ) A document query specifies documents of interest based on entities that qualify for the structured keyword queries using AND semantics. A structured keyword query describes a set of entities as defined below. Definition 2 (Structured Keyword Query) Let k be a keyword phrase (one or more keywords), then a structured keyword query Q is defined by the follow- 8

9 ing grammar. Q ::= k k(q) Q 1, Q 2,..., Q n The first construct allows a primitive concept in a query to be described by a set of one or more keywords (e.g., nobel prize ). The second construct allows one to describe an entity in terms of the relationships it has to other entities (e.g., born in(germany) ). The third constructor allows a set of queries to describe a single class of entities in conjunction (e.g., harmonica player, songwriter ). The comma operator is used to explicitly label the conjunctions, reducing the ambiguity among multi-keyword phrases that occur in conjunction. Note that the relation query constructor allows an arbitrary query to describe the entity to which the relation extends, for example, born in(country, has official language(spanish)) could be used to query all entities born in spanish speaking countries. More formally, the resulting entity set {e} for a structured keyword query Q denoting general concept C is given by the following: {e e C}. We use the notation e Q to say that entity e is in the result set of query Q. Section 5 discusses how to find the concept C which is denoted by Q. To illustrate how structured keyword queries are used in document queries, consider the following example document query. (person, born in(germany)), (participated(world War II)) The query indicates that qualifying documents should mention both a person born in Germany and some entity that participated in World War II. More formally, the resulting document set {D} for a document query DQ = (Q 1 ), (Q 2 ),..., (Q n ) is given by the following: {D Q DQ e Q such that e D} where e D denotes entity e occurring in the text of document D. Intuitively, the document retrieval uses OR semantics per entity set denoted by a subquery, and AND semantics across subqueries in the DQ. 5 Disambiguating Query Intent Because our query language is built on keywords as primitives, there is an inherent ambiguity in the semantics of the query. The user s intentions are unknown since each keyword in the query could map to one of many items from the knowledge base, and there is an exponential number of ways in which we could select an item for each keyword to create a general concept. 9

10 Figure 3: An example query disambiguation graph. 5.1 The Disambiguation Model Consider a structured keyword query Q. Each keyword in Q could have many possible schema item matches based on some measure of syntactic similarity. Let M(k) denote the set of possible schema item (entity/concept/relation) mappings of keyword k. The goal is to choose one schema item for each keyword in the query, such that the resulting disambiguation is a meaningful representation of the query with respect to the knowledge base, and is also representative of the user s intention. There is a tradeoff that must be considered when choosing concept interpretations for each keyword: we want mappings that represent the users intentions as best as possible in terms of syntactic similarity, but which also have a meaningful interpretation in terms of the underlying knowledge base by means of semantic relatedness. We encode our disambiguation problem as a graph which represents the space of all possible interpretations that bind one entity, concept, or relation to each keyword. In the discussion that follows, we refer to a concept or the subgraph that denotes a concept interchangeably. Definition 3 (Disambiguation Graph) Let Q be a structured keyword query. Then a disambiguation graph G = V, E with vertex set V, edge set E, and P a set of disjoint partitions of the nodes in V, are defined by the following. V = M(k) k Q P = {M(k)} k Q E = edges(q) 10

11 where the edge generating function edges(q) is defined as follows. edges(k) = {} edges(k(q)) = edges(q 1, Q 2 ) = n 1 M(k), n 2 root(q) n 1 root(q 1), n 2 root(q 2) n 1, n 2 edges(q) n 1, n 2 edges(q 1) edges(q 2 ) where root(q) denotes the vertices in the root level query of Q as follows. root(k) = M(k) root(k(q)) = M(k) root(q 1, Q 2 ) = root(q 1 ) root(q 2 ) (Note that we have abstracted n-ary conjunctions as binary conjunctions for ease of presentation, arbitrary conjunctions can be formed by considering all pairwise conjunctions in the n-ary expression.) Figure 3 depicts an example of a disambiguation graph for the query german, scientists, have won(nobel award). The boxes denote vertices (corresponding to concepts or entities in the underlying knowledge base), the dotted circles denote partitions (the sets M(k i )), the italic font labels denote the keywords for which the partition was generated (k i ), and the interconnecting lines denote edges which represent semantic similarity. Observe that each level of any subquery is fully connected t-partite, while edges do not span query nesting boundaries. This is because nested concepts do not occur in conjunction with concepts at other nested levels. They are related only through the vertices that denote the explicit knowledge base relations. For a disambiguation graph G generated from a structured keyword query Q, any induced subgraph of G that spans all partitions in P corresponds to a concept interpretation of Q. For example, the induced subgraph spanning the nodes {GERMAN PEOPLE, SCIENTIST, haswonprize, NOBEL PRIZE} corresponds to the concept GERMAN PEOPLE, SCIENTIST, haswonprize(nobel PRIZE). Figure 4 illustrates a disambiguation graph for a more complex query with multiple nesting branches and multiple levels of nesting. In this example, a meaningful interpretation could be the following concept. GERMAN PEOPLE, SCIENTIST, haswonprize(nobel PRIZE), 11

12 Figure 4: An example of a query disambiguation graph. Each edge represents complete bipartite connectivity between all vertices in the partitions. livesin(country, hasofficiallanguage(english Language)) Note the difficulty in disambiguating this query. In one part of the query the keyword german is used to denote a nationality, while in another part of the query the keyword english is used to denote a language. Both of these keywords are legitimate choices to denote both nationalities and languages, and consequently purely syntactic matching alone could not possibly handle both situations correctly. This phenomena is known to linguists as polysemy. Similar issues arise with synonymy (for example, the phrases guitarist and guitar player both denote the same concept). It is evident that the space of possible query interpretations is exponential in the size of the query. Finding the best or top-k interpretations will depend on how we compute a score for a candidate subgraph corresponding to a query interpretation. 5.2 The Scoring Model Now that we have established a model for generating a space of candidate query interpretations, we need to define the notion of a score in order to compute the top-k interpretations. We start by reviewing notions of semantic and syntactic similarity which will form the basis of our scoring function. Semantic Similarity The first factor of our scoring model is semantic similarity, which we denote generically using semanticsim(a, B). The problem of computing the similarity between two concepts has a long history in the computational linguistics and artificial intelligence fields. Early works focused on similarities over taxonomies, 12

13 often including the graph distance of concepts [12, 14, 23], and probabilistic notions of information content of classes in the taxonomy [12, 15, 19]. More recent works have looked at generalizing semantic similarity metrics from taxonomies to general ontologies (i.e., from trees to graphs) [5, 16, 20]. As an example, Lin defines semantic similarity between two concepts as the following [15]. Sim Lin (A, B) = 2 log(p (LCS(A, B)) log(p (A)) + log(p (B))) where LCS(A, B) denotes the least common subsumer of A and B, i.e., the first common ancestor in an is-a graph, and P (A) denotes the probability of a randomly drawn entity belonging to concept A. Lastly, there are the traditional notions of set similarity which can be applied, since concepts in a taxonomy or ontology denote sets of entities. Two commonly used metrics are the Dice coefficient and the Jaccard index, the latter illustrated below. Sim Jaccard (A, B) = A B A B One common trait to all of these similarity measures is that they are binary; they express similarity over pairs of concepts. We will see that efficiently accommodating n-ary similarities using preprocessed statistics poses challenges, though our model allows a natural approximation based on aggregates of binary metrics. We now introduce the notion of knowledge base support, which characterizes the semantic similarity of all primitive concepts and relations occurring in a complex general concept. In the definition of knowledge base support, we appeal to n-ary similarity, with our approximation technique to be introduced in Section 5.3. Definition 4 (Knowledge Base Support) Let C be a general concept expressed using the primitive concepts, relations, and entities in the knowledge base KB. Then the support of C by KB is given by the following. support(c, KB) = semanticsim(c1 0,, Cn 0 0, R1, 0,..., Rm 0 0 ) + semanticsim(r1, 0..., Rm 0 0, C1 1,, Cn 1 1, R1, 1..., Rm 1 1 ) + + semanticsim(r p 1 1,..., Rm p 1 p 1, C1 1,, Cn p p, R p 1,..., R1 m p ) where each C i j (Ri j ) is the jth primitive concept (relation) in a conjunction occurring in C at nesting branch i. Intuitively, this is the similarity according to the structure of the disambiguation graph (which corresponds to the structure of the query). We compute the similarity of primitive concepts occurring in conjunction, along with the incoming relations from the previous nesting level, and the outgoing relations to the next nesting level. 13

14 For our example in Figure 4, with the previously suggested disambiguation, the support would correspond to the following. semanticsim(german PEOPLE, SCIENTIST, haswonprize, livesin) + semanticsim(haswonprize, NOBEL PRIZE) + semanticsim(livesin, COUNTRY, hasofficiallanguage) + semanticsim(hasofficiallanguage, English Language) Syntactic Similarity The second component to our scoring model is the syntactic matching of the concept label (or one of its synonyms) to the keyword. Indeed, the user entering queries has an idea of the entities they are trying to describe. It is simply the terminology of the user and the knowledge base that may differ, and the massive scale of the knowledge base impedes learning on the user s part. The closer we can match a query syntactically while maximizing semantic similarity, the better our query interpretation will be with respect to both the user s intentions and the knowledge encoded in our knowledge base. We use the function label syntaxsim(a, b) to denote a measure of syntactic similarity. This could be simple keyword occurrence in the knowledge base item s label, or a more relaxed measure such as edit distance. Score Aggregation Our final scoring model is thus some combination of knowledge base support, which quantifies semantic support for the candidate disambiguation, and syntactic similarity, which reflects how closely the candidate concept interpretation matches the user s initial description of their query intention. We can tune how important each factor is by how we combine and weight the two similarity metrics. In general, we allow any aggregation function to be plugged into our system. However, as we will see in later sections, we can improve efficiency if the aggregation function is monotonic by making use of threshold algorithms for top-k search. We denote such an aggregation function using. Definition 5 (Concept Score) Let Q be a structured keyword query, C a concept, a binary aggregation function, and KB a knowledge base. Then the score of concept C, with respect to Q, KB, and is given by the following. score(c, Q, KB) = support(c, KB) syntaxsim(c, Q) Given a disambiguation graph G with partition set P and a parameter integer k, the goal of disambiguation is to find the top-k maximum scoring subgraphs of G that span all partitions in P. Intuitively, we want to find the best k interpretations of the query in terms of some balance between knowledge base support and syntactic similarity. The difficulty in solving the disambiguation problem lies in the nature of the scoring function. Looking back to our disambiguation graph model, the score of a general concept representing a candidate query interpretation depends 14

15 on all components of the concept (all conjunctions of primitive concepts and relations). From the disambiguation graph point of view, this means that the score (or weight) of each edge changes depending on which other edges are part of the subgraph forming the candidate concept interpretation. The amount of statistics that would have to be (pre)computed in order to support queries with n terms would be very large, one would need to have access to the semantic similarity of all n-way compositions of entities, primitive concepts, and relations occurring in the knowledge base. 5.3 Approximating the Scoring Model In most cases, having access to n-way semantic similarity would require precomputing an infeasible number of statistics, or incurring the expensive cost of computing similarity at query time. Because there are an exponential number of candidate query interpretations, computing similarity at query time could add unacceptable costs to query performance. For example, the previously described Sim Lin metric requires a least common subsumer computation over a large graph, while the Jaccard and Dice metrics require intersections of (possibly large) sets. Thus, we move to an approximation model for any binary metric of semantic similarity. We approximate the score of a candidate query interpretation by aggregating the pairwise support of all components of the candidate concept. (Note that pairwise support reduces Definition 4 to a single binary similarity computation.) In terms of the disambiguation graph, this means that each edge weight in the graph can represent the support of having the two primitive concepts or relations denoted by the edge s vertices in conjunction. We extend a disambiguation graph G to include a weight function w that assigns weights to the vertices (v) and edges ( v 1, v 2 ) of G as follows. The weights are computed with respect to some knowledge base KB and the keywords k from a structured keyword query Q. w(v) = syntaxsim(label(v), k), where v M(k) w( v 1, v 2 ) = support((item(v 1 ), item(v 2 )), KB) where item(v) denotes the knowledge base schema item (primitive concept, relation, or entity) represented by vertex v and label(v) denotes the string representation of the schema item represented by v. 2 The approximate score of an induced subgraph denoting a candidate concept interpretation is given by the following. Definition 6 (Approximate Score) Let Q be a structured keyword query, KB a knowledge base, and G a subgraph of the disambiguation graph of Q representing a concept C. Then the approximate score of C represented by G 2 In practice, it would be beneficial to consider the syntactic similarity of the keyword to the closest synonym of the concept label. 15

16 with respect to Q and KB is given by the following. score(g, ˆ Q, KB) = w( v 1, v 2 ) w(v) v V v 1,v 2 E 5.4 Solving the Disambiguation Problem The goal of the disambiguation problem is to find the top-k maximally scoring subgraphs (corresponding to concept interpretations of the original query) which are single connected components. Note that simply taking the maximum scoring subgraph for each subquery will not necessarily result in a connected graph, and thus does not map to a concept interpretation of the query. If our scoring function is monotonic, we can implement a threshold algorithm (TA) to efficiently find the top scoring subgraphs, so long as we have access to vertices and edges in sorted order. A global ordering over all edges can be precomputed and indexed, and vertices denoting concepts can be sorted on the fly since these sets fit in memory and can be kept relatively small in practice. Alternatively, we can implement a TA if we have access to only the vertices in sorted order and allow random access to edges. The disambiguation problem can be viewed as a relational rank-join, a TA which implements a special case of A* search over join graphs. In this view of the problem, our disambiguation graph is a join graph, with each keyword in the query providing an input to a rank-join operation based on the partition corresponding to the keyword. Each partition is ordered by syntactic similarity. Consider each possible concept or relation match (partition M(k)), ordered by syntactic similarity, as an input. We join two concepts if they have knowledge base support, and rank the join based on the value of the approximate score. Simultaneously, we join across nested subqueries based on equivalence of the vertex representing relation keyword. This ensures the resulting joins form connected graphs in terms of the underlying disambiguation graph. As an example, consider the disambiguation graph from Figure 3, we illustrate a rank-join approach to disambiguation of the query in Figure 5. The illustration shows each partition as an input to the rank join. Conjunctive components are joined based on the existence of edges in the disambiguation graph, while the entire nested queries are joined based on equality of the binding of the relation keyword to a schema relation. For brevity, we omit a review of the threshold algorithm for rank-join. The challenge here is in the modeling of the problem as a rank-join as we have presented. With this model in mind, the actual execution does not differ from that of a traditional rank-join. We refer the reader to [9] for an overview of the algorithm. While solving the disambiguation problem involves possibly exploring an exponential search space in the worst case, our experiments show that we can still find meaningful solutions quickly in practice (see Section 7). 16

17 Figure 5: Encoding disambiguation as a rank-join with random access to edges. The join conditions are on the existence of edges in the disambiguation graph for each subquery, and on the equality of relation bindings to join across all subqueries. 5.5 Integrating Full Text Search As a final consideration, it may be the case that no subgraph can be found that spans all partitions in a disambiguation graph for some query. This happens when we can not find a meaningful interpretation for one or more keywords. In this case we proceed as follows: We relax the problem definition to not require spanning all partitions when finding a subgraph of the disambiguation graph. This allows us to build partial interpretations. We augment our scoring model to add a constant penalty factor for each unmapped keyword. This allows us to tune preference for complete disambiguations. Lastly, we include all keywords that are unmapped as part of the document retrieval process. In this setting, we will retrieve documents that contain one or more entities from the concept interpretation of the query (as before), and favor those which contain the additional keywords in the ranking. For example, a partial interpretation for the query german, scientists, have won (nobel award) could be the concept SCIENTIST, haswonprize(nobel PRIZE) which would then be parameterized with the keyword german during docu- 17

18 ment retrieval. Note that in the worst case (in terms of disambiguation), our system degrades to a regular keyword search. 6 Indexing and Query Evaluation One way to support inference over an ontology or type hierarchy in a semantic search system is by expanding the knowledge into the underlying data [1, 2, 3]. In the document retrieval model, this amounts to computing the deductive closure over each entity discovered in the text, and expanding the full closure into the document for indexing. For example, if one finds the entity Albert Einstein in a document, they would add identifiers for PHYSICIST, SCIENTIST, PERSON, MAMAL, LIVING ORGANISM, etc..., to the document. This way a query for PHYSICIST will retrieve documents that contain the entity Albert Einstein. The benefit of this approach is in its simplicity, one need not make any special modifications to the indexing procedure or the search algorithm in order to support inference over the hierarchy. A drawback of this method is that it greatly reduces the capabilities for inference during query processing. For example, even if we expand the full closure of each entities relations into the document, we still couldn t answer a complex query such as born in(located in(has language(german))) to find documents about Albert Einstein (an entity who was born in a city that is located in a country with German as the spoken language). The other major drawback to this approach is the redundancy in the resulting inverted index. Sets of entities will be repeated over and over again for each concept of related type. For example, the index entry for SCIENTIST will repeat the entire index entry for PHYSICIST, since every document mentioning a physicist will also be mentioning a scientist by transitivity. This can potentially cause an infeasible amount of redundancy, even for relatively simple ontologies. Current approaches to this problem restrict the amount of the closure that is expanded into the document, and require that any query can then be mapped to one of a small number of pre-defined top level classes. 6.1 Hierarchical Inverted Index We propose a family of combined document and ontology index structures referred to as a Hierarchical Inverted Index, or HII, that build on traditional notions of graph representations and inverted indexing. The motivation behind our index structure is to avoid redundancy in the index to be space efficient, and support inference over the ontology at run time through query expansion. We detail the design of the index below, and explore the impact on performance in Section 7. To support querying over simple hierarchies (taxonomies) we describe the simplest instantiation of our index, HII α, depicted in Figure 6. For each primitive concept and entity, the general idea is to index the inverse direction of the is-a relations in order to support finding more specific concepts and entities 18

19 C 1 : {D a, D b,...}, {is-a {C 1, C 2,...} C 2 : {D i, D j,...}, {is-a {C 1, C 2,...}. C n : {D l, D p,...}, {is-a {C n 1, C n 2,...} Figure 6: HII α : an inverted index mapping entities and concepts to documents with links in the inverse direction of the hierarchical is-a relationship C 1 : {D a, D b,...}, {R 1 {C 1, C 2,...},..., R m{c 1, C 2,...}} C 2 : {D i, D j,...}, {R 1 {C 1, C 2,...},..., Rm{C 1, C 2,...}}. C n : {D l, D p,...}, {R 1 {C n 1, C n 2,...},..., R m{c n 1, C n 2,...}} Figure 7: HII β : an inverted index mapping entities and concepts to documents with links in the inverse direction of all relations R (including the hierarchical is-a relationship). of a specified entity type. This is a subset of the adjacency list based graph encoding. For each index entry, the document collection directly associated with that concept or entity (but not its entire closure) is listed. Note that we may also populate strictly keyword based entries to support full text search over the same inverted index. The problem with the simple HII α is that querying by relations can not efficiently be supported (for example, born in(canada) ). We extend HII α to HII β in Figure 7 to support these types of queries by also indexing the inverse direction of all relations R. This also gives us the ability to index ontologies with more complex concept definitions which include both conjunctions and relations. For example, NOBEL LAUREATE is-a (PERSON, haswonprize(nobel PRIZE)). This index does not introduce any extra redundancy, but rather each entry references the related entries via the appropriate relations. HII β is also encodes sufficient information to efficiently support the disambiguated structured keyword queries described in Section 4. Figure 8 shows an example index for a portion of the KB in our running example. Lastly, for completeness, we note that our index family can be extended to include a full ontology index by also indexing inverse relations. In our index structure, this amounts to indexing the forward direction of relations (the inverse of the inverse) to support querying over inverse relations (e.g., bornin (Albert Einstein)). 19

20 NOBEL PRIZE : {}, {is-a {Nobel Prize In Physics, Nobel Prize In Chemistry, Nobel Peace Prize,...}} SCIENTIST : {}, {is-a {PHYSICIST, CHEMIST, COMPUTER SCIENTIST,..., Isaac Newton,...},...} PHYSICIST : {}, {is-a {Albert Einstein, Max Planck, Galileo Galilei,...},...} CHEMIST : {}, {is-a {Marie Curie,...},...} Nobel Prize In Physics : {D 4 }, {haswonprize {Albert Einstein, Max Planck, Marie Curie,...},...} Nobel Prize In Chemistry : {D 9 }, {haswonprize {Marie Curie, Otto Wallach,... },...} Albert Einstein : {D 1, D 4, D 12 }, {ismarriedto {Mileva Maric},...} Max Planck : {D 4, D 7 }, {hasacademicadvisor {Gustav Ludwig Hertz, Walther Bothe, Max von Laue},...}} Figure 8: Part of an example HII β index. 6.2 Query Evaluation The procedure illustrated in Figure 9 details the general search procedure for query processing over the hierarchical inverted index. We use index-lookup(c) to denote retrieving an index entry for entity or primitive concept C, and use get(relation) to denote retrieving the entity and concepts associated with relation from the index entry. For notational convenience, we also use the function getents() to denote retrieving the entities associated hierarchical is-a relation, and use getsubclasses() to denote retrieving the primitive concepts associated with the hierarchical is-a relation. Consider a search for the entities described by the following concept over the index shown in Figure 8. PHYSICIST, haswonprize(nobel PRIZE) The first call to our eval procedure would compute the intersection of the recursive calls to eval over PHYSICIST and haswonprize (NOBEL PRIZE). The call to eval(physicist) would run by collecting the entities in the index entry, and would then recursively walk down the hierarchy by following any references (in the example index there are none). The call to eval(haswonprize (NOBEL PRIZE)) would look up the index entry for NOBEL PRIZE and collect any entities in the corresponding haswonprize set (in the example, this set is again empty). For primitive concepts occurring in this set, we would evaluate them and collect all entities by descending the hierarchy in the same way as was done for PHYSICIST. We then descend the hierarchy for NOBEL PRIZE by following the is-a references, collecting all entities in the haswonprize sets in each recursive call. In the example index, we will find all entities corresponding to the haswon- Prize relation for the Nobel Prize In Physics and Nobel Prize In Chemistry 20

21 eval(q 1, Q 2): eval(r(c)): return eval(q 1) eval(q 2) entry = index-lookup(c) related = entry.get(r) eval(c): for each(c in related) entry = index-lookup(c) res = res eval(c ) res = entry.getents() sc = entry.get(is-a) sc = entry.getsubclasses() for each(c in sc) for each (C in sc) res = res eval(r(c )) res = res eval(c ) end for end for return res return res Figure 9: Hierarchical Index traversal procedure. entries. This set includes {Albert Einstein, Max Planck, Marie Curie, Otto Wallach} which we then intersect with our set of PHYSICISTs, {Albert Einstein, Max Planck, Galileo Galilei} yielding the query result, {Albert Einstein,Max Planck} and their corresponding documents {D 1, D 4, D 7, D 12}. 6.3 Query Results and Ranking There are a number of ways one could envision presenting and ranking results from our system. One approach could have entities as results, each with a list of associated documents. The entities could be ranked by their relationships to the query in the ontology, or based on statistics over the document corpus. Alternatively, one may want a document list as the result, ranked by the relevance of the corresponding entity to the document using a form of tf-idf measure. A third option could combine these two approaches, ranking groups of documents per entity, and ranking the groups based on an overall ranking of entities. This would be analogous to a GROUP BY statement in relational query processing, in this case grouping by entity. Various works have looked at entity specific ranking techniques [4, 8, 18]. While ranking is not the focus of this paper, it is a necessary component to a deployable system. We take the simple approach of ranking the resulting document set as a whole, ranking based on the tf-idf score of the corresponding entities that qualified the document as a result. 7 Experimental Evaluation We have implemented the entire QUICK system as depicted in Figure 2, with the exception of the feedback mechanism for learning new facts from text. 21

22 QUICK PostgreSQL KB Representation Size 1.2 GB 41 GB Average CPU Usage 100% 29.2% Figure 10: Statistics for experimental runs. The focus of our experimental evaluation is on performance. The quality of query disambiguations and quality of document results is an important issue, but its relevance is diminished if the system does not perform at a reasonable standard. While additional user studies are required to evaluate the the quality of our system, we are only concerned with the performance aspects in this paper and leave quality evaluation as future work. Informally however, we find in practice that our system does return relevant documents and generally finds intended query interpretations. 7.1 System Descriptions QUICK is built on a variety of open-source software. For the information extraction component, we built a simple text processor using the OpenNLP library. 3 This tool will parse and annotate person, location, and organization named entities. For simplicity in document retrieval, we build the document side of our hierarchical index separate from the KB information. We built a simple text and named entity indexer using the Lucene library. 4 The hierarchical inverted index was written in Java and uses Berkeley DB 5 as a back-end data store. Our query disambiguation system was also written in Java and uses a separate Berkeley DB store to maintain the knowledge base semantic similarity measures. All semantic similarity values used for disambiguation are pre-computed. We use the Jaccard index discussed in Section 5 as our semantic similarity measure, and the Levenshtein edit distance for syntactic similarity with addition as our aggregation function. We compare our system to a more traditional approach for semantic search, a relational back-end with a pre-computed closure of the knowledge base. In this system, all inference that could possibly be done over the knowledge base is pre-computed and stored. We use the tools distributed with YAGO to create the database image (both the closure and indexing). We also use the query processor distributed with YAGO to generate SQL equivalents of our workload. We use PostgreSQL 6 as our relational database. 7.2 Data and Workload We use the 2008 version of YAGO [21] as our knowledge base (note that this is approximately three times larger than the initial instance reported in [21])

A Keyword-based Structured Query Language

A Keyword-based Structured Query Language Expressive and Flexible Access to Web-Extracted Data : A Keyword-based Structured Query Language Department of Computer Science and Engineering Indian Institute of Technology Delhi 22th September 2011

More information

On Ordering and Indexing Metadata for the Semantic Web

On Ordering and Indexing Metadata for the Semantic Web On Ordering and Indexing Metadata for the Semantic Web Jeffrey Pound, Lubomir Stanchev, David Toman,, and Grant E. Weddell David R. Cheriton School of Computer Science, University of Waterloo, Canada Computer

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

2.2 Syntax Definition

2.2 Syntax Definition 42 CHAPTER 2. A SIMPLE SYNTAX-DIRECTED TRANSLATOR sequence of "three-address" instructions; a more complete example appears in Fig. 2.2. This form of intermediate code takes its name from instructions

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

DATA MODELS FOR SEMISTRUCTURED DATA

DATA MODELS FOR SEMISTRUCTURED DATA Chapter 2 DATA MODELS FOR SEMISTRUCTURED DATA Traditionally, real world semantics are captured in a data model, and mapped to the database schema. The real world semantics are modeled as constraints and

More information

COMP718: Ontologies and Knowledge Bases

COMP718: Ontologies and Knowledge Bases 1/35 COMP718: Ontologies and Knowledge Bases Lecture 9: Ontology/Conceptual Model based Data Access Maria Keet email: keet@ukzn.ac.za home: http://www.meteck.org School of Mathematics, Statistics, and

More information

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu

Presented by: Dimitri Galmanovich. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu Presented by: Dimitri Galmanovich Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung Wu 1 When looking for Unstructured data 2 Millions of such queries every day

More information

Inference in Hierarchical Multidimensional Space

Inference in Hierarchical Multidimensional Space Proc. International Conference on Data Technologies and Applications (DATA 2012), Rome, Italy, 25-27 July 2012, 70-76 Related papers: http://conceptoriented.org/ Inference in Hierarchical Multidimensional

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

RAQUEL s Relational Operators

RAQUEL s Relational Operators Contents RAQUEL s Relational Operators Introduction 2 General Principles 2 Operator Parameters 3 Ordinary & High-Level Operators 3 Operator Valency 4 Default Tuples 5 The Relational Algebra Operators in

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

CSC 501 Semantics of Programming Languages

CSC 501 Semantics of Programming Languages CSC 501 Semantics of Programming Languages Subtitle: An Introduction to Formal Methods. Instructor: Dr. Lutz Hamel Email: hamel@cs.uri.edu Office: Tyler, Rm 251 Books There are no required books in this

More information

Structural and Syntactic Pattern Recognition

Structural and Syntactic Pattern Recognition Structural and Syntactic Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

A Formalization of Transition P Systems

A Formalization of Transition P Systems Fundamenta Informaticae 49 (2002) 261 272 261 IOS Press A Formalization of Transition P Systems Mario J. Pérez-Jiménez and Fernando Sancho-Caparrini Dpto. Ciencias de la Computación e Inteligencia Artificial

More information

Appendix 1. Description Logic Terminology

Appendix 1. Description Logic Terminology Appendix 1 Description Logic Terminology Franz Baader Abstract The purpose of this appendix is to introduce (in a compact manner) the syntax and semantics of the most prominent DLs occurring in this handbook.

More information

Appendix 1. Description Logic Terminology

Appendix 1. Description Logic Terminology Appendix 1 Description Logic Terminology Franz Baader Abstract The purpose of this appendix is to introduce (in a compact manner) the syntax and semantics of the most prominent DLs occurring in this handbook.

More information

X-KIF New Knowledge Modeling Language

X-KIF New Knowledge Modeling Language Proceedings of I-MEDIA 07 and I-SEMANTICS 07 Graz, Austria, September 5-7, 2007 X-KIF New Knowledge Modeling Language Michal Ševčenko (Czech Technical University in Prague sevcenko@vc.cvut.cz) Abstract:

More information

Supplementary Notes on Abstract Syntax

Supplementary Notes on Abstract Syntax Supplementary Notes on Abstract Syntax 15-312: Foundations of Programming Languages Frank Pfenning Lecture 3 September 3, 2002 Grammars, as we have discussed them so far, define a formal language as a

More information

Intelligent flexible query answering Using Fuzzy Ontologies

Intelligent flexible query answering Using Fuzzy Ontologies International Conference on Control, Engineering & Information Technology (CEIT 14) Proceedings - Copyright IPCO-2014, pp. 262-277 ISSN 2356-5608 Intelligent flexible query answering Using Fuzzy Ontologies

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Domain-specific Concept-based Information Retrieval System

Domain-specific Concept-based Information Retrieval System Domain-specific Concept-based Information Retrieval System L. Shen 1, Y. K. Lim 1, H. T. Loh 2 1 Design Technology Institute Ltd, National University of Singapore, Singapore 2 Department of Mechanical

More information

IJCSC Volume 5 Number 1 March-Sep 2014 pp ISSN

IJCSC Volume 5 Number 1 March-Sep 2014 pp ISSN Movie Related Information Retrieval Using Ontology Based Semantic Search Tarjni Vyas, Hetali Tank, Kinjal Shah Nirma University, Ahmedabad tarjni.vyas@nirmauni.ac.in, tank92@gmail.com, shahkinjal92@gmail.com

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008

Ontology Matching with CIDER: Evaluation Report for the OAEI 2008 Ontology Matching with CIDER: Evaluation Report for the OAEI 2008 Jorge Gracia, Eduardo Mena IIS Department, University of Zaragoza, Spain {jogracia,emena}@unizar.es Abstract. Ontology matching, the task

More information

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES Introduction to Information Retrieval CS 150 Donald J. Patterson This content based on the paper located here: http://dx.doi.org/10.1007/s10618-008-0118-x

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

Relational Databases

Relational Databases Relational Databases Jan Chomicki University at Buffalo Jan Chomicki () Relational databases 1 / 49 Plan of the course 1 Relational databases 2 Relational database design 3 Conceptual database design 4

More information

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4 Query Languages Berlin Chen 2005 Reference: 1. Modern Information Retrieval, chapter 4 Data retrieval Pattern-based querying The Kinds of Queries Retrieve docs that contains (or exactly match) the objects

More information

This book is licensed under a Creative Commons Attribution 3.0 License

This book is licensed under a Creative Commons Attribution 3.0 License 6. Syntax Learning objectives: syntax and semantics syntax diagrams and EBNF describe context-free grammars terminal and nonterminal symbols productions definition of EBNF by itself parse tree grammars

More information

Reflection in the Chomsky Hierarchy

Reflection in the Chomsky Hierarchy Reflection in the Chomsky Hierarchy Henk Barendregt Venanzio Capretta Dexter Kozen 1 Introduction We investigate which classes of formal languages in the Chomsky hierarchy are reflexive, that is, contain

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

A Linguistic Approach for Semantic Web Service Discovery

A Linguistic Approach for Semantic Web Service Discovery A Linguistic Approach for Semantic Web Service Discovery Jordy Sangers 307370js jordysangers@hotmail.com Bachelor Thesis Economics and Informatics Erasmus School of Economics Erasmus University Rotterdam

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

Document Retrieval using Predication Similarity

Document Retrieval using Predication Similarity Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research

More information

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A. 1 Acknowledgements This lecture series has been sponsored by the European

More information

Programming Languages Third Edition

Programming Languages Third Edition Programming Languages Third Edition Chapter 12 Formal Semantics Objectives Become familiar with a sample small language for the purpose of semantic specification Understand operational semantics Understand

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

A Comparison of Three Document Clustering Algorithms: TreeCluster, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering

A Comparison of Three Document Clustering Algorithms: TreeCluster, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering A Comparison of Three Document Clustering Algorithms:, Word Intersection GQF, and Word Intersection Hierarchical Agglomerative Clustering Abstract Kenrick Mock 9/23/1998 Business Applications Intel Architecture

More information

DITA for Enterprise Business Documents Sub-committee Proposal Background Why an Enterprise Business Documents Sub committee

DITA for Enterprise Business Documents Sub-committee Proposal Background Why an Enterprise Business Documents Sub committee DITA for Enterprise Business Documents Sub-committee Proposal Background Why an Enterprise Business Documents Sub committee Documents initiate and record business change. It is easy to map some business

More information

Reading 1 : Introduction

Reading 1 : Introduction CS/Math 240: Introduction to Discrete Mathematics Fall 2015 Instructors: Beck Hasti and Gautam Prakriya Reading 1 : Introduction Welcome to CS 240, an introduction to discrete mathematics. This reading

More information

Performance Evaluation of Semantic Registries: OWLJessKB and instancestore

Performance Evaluation of Semantic Registries: OWLJessKB and instancestore Service Oriented Computing and Applications manuscript No. (will be inserted by the editor) Performance Evaluation of Semantic Registries: OWLJessKB and instancestore Simone A. Ludwig 1, Omer F. Rana 2

More information

Semantics Modeling and Representation. Wendy Hui Wang CS Department Stevens Institute of Technology

Semantics Modeling and Representation. Wendy Hui Wang CS Department Stevens Institute of Technology Semantics Modeling and Representation Wendy Hui Wang CS Department Stevens Institute of Technology hwang@cs.stevens.edu 1 Consider the following data: 011500 18.66 0 0 62 46.271020111 25.220010 011500

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Debugging the missing is-a structure within taxonomies networked by partial reference alignments

Debugging the missing is-a structure within taxonomies networked by partial reference alignments Debugging the missing is-a structure within taxonomies networked by partial reference alignments Patrick Lambrix and Qiang Liu Linköping University Post Print N.B.: When citing this work, cite the original

More information

Boolean Representations and Combinatorial Equivalence

Boolean Representations and Combinatorial Equivalence Chapter 2 Boolean Representations and Combinatorial Equivalence This chapter introduces different representations of Boolean functions. It then discusses the applications of these representations for proving

More information

Chapter S:II. II. Search Space Representation

Chapter S:II. II. Search Space Representation Chapter S:II II. Search Space Representation Systematic Search Encoding of Problems State-Space Representation Problem-Reduction Representation Choosing a Representation S:II-1 Search Space Representation

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Interactions A link message

Interactions A link message Interactions An interaction is a behavior that is composed of a set of messages exchanged among a set of objects within a context to accomplish a purpose. A message specifies the communication between

More information

Smart Open Services for European Patients. Work Package 3.5 Semantic Services Definition Appendix E - Ontology Specifications

Smart Open Services for European Patients. Work Package 3.5 Semantic Services Definition Appendix E - Ontology Specifications 24Am Smart Open Services for European Patients Open ehealth initiative for a European large scale pilot of Patient Summary and Electronic Prescription Work Package 3.5 Semantic Services Definition Appendix

More information

is easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology

is easing the creation of new ontologies by promoting the reuse of existing ones and automating, as much as possible, the entire ontology Preface The idea of improving software quality through reuse is not new. After all, if software works and is needed, just reuse it. What is new and evolving is the idea of relative validation through testing

More information

Propositional Logic. Part I

Propositional Logic. Part I Part I Propositional Logic 1 Classical Logic and the Material Conditional 1.1 Introduction 1.1.1 The first purpose of this chapter is to review classical propositional logic, including semantic tableaux.

More information

Information Regarding Program Input.

Information Regarding Program Input. 1 APES documentation (revision date: 17.03.10). Information Regarding Program Input. General Comments Finite element analyses generally require large amounts of input data. The preparation of this data

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

CHAPTER-26 Mining Text Databases

CHAPTER-26 Mining Text Databases CHAPTER-26 Mining Text Databases 26.1 Introduction 26.2 Text Data Analysis and Information Retrieval 26.3 Basle Measures for Text Retrieval 26.4 Keyword-Based and Similarity-Based Retrieval 26.5 Other

More information

Evaluation of Predicate Calculus By Arve Meisingset, retired research scientist from Telenor Research Oslo Norway

Evaluation of Predicate Calculus By Arve Meisingset, retired research scientist from Telenor Research Oslo Norway Evaluation of Predicate Calculus By Arve Meisingset, retired research scientist from Telenor Research 31.05.2017 Oslo Norway Predicate Calculus is a calculus on the truth-values of predicates. This usage

More information

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS 82 CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS In recent years, everybody is in thirst of getting information from the internet. Search engines are used to fulfill the need of them. Even though the

More information

Hierarchical Online Mining for Associative Rules

Hierarchical Online Mining for Associative Rules Hierarchical Online Mining for Associative Rules Naresh Jotwani Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar 382009 INDIA naresh_jotwani@da-iict.org Abstract Mining

More information

jcel: A Modular Rule-based Reasoner

jcel: A Modular Rule-based Reasoner jcel: A Modular Rule-based Reasoner Julian Mendez Theoretical Computer Science, TU Dresden, Germany mendez@tcs.inf.tu-dresden.de Abstract. jcel is a reasoner for the description logic EL + that uses a

More information

Measures of Clustering Quality: A Working Set of Axioms for Clustering

Measures of Clustering Quality: A Working Set of Axioms for Clustering Measures of Clustering Quality: A Working Set of Axioms for Clustering Margareta Ackerman and Shai Ben-David School of Computer Science University of Waterloo, Canada Abstract Aiming towards the development

More information

XML Query (XQuery) Requirements

XML Query (XQuery) Requirements Página 1 de 15 XML Query (XQuery) Requirements W3C Working Draft 12 November 2003 This version: http://www.w3.org/tr/2003/wd-xquery-requirements-20031112 Latest version: http://www.w3.org/tr/xquery-requirements

More information

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS

WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS 1 WEB SEARCH, FILTERING, AND TEXT MINING: TECHNOLOGY FOR A NEW ERA OF INFORMATION ACCESS BRUCE CROFT NSF Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts,

More information

Ontological Modeling: Part 2

Ontological Modeling: Part 2 Ontological Modeling: Part 2 Terry Halpin LogicBlox This is the second in a series of articles on ontology-based approaches to modeling. The main focus is on popular ontology languages proposed for the

More information

8. Relational Calculus (Part II)

8. Relational Calculus (Part II) 8. Relational Calculus (Part II) Relational Calculus, as defined in the previous chapter, provides the theoretical foundations for the design of practical data sub-languages (DSL). In this chapter, we

More information

Concept as a Generalization of Class and Principles of the Concept-Oriented Programming

Concept as a Generalization of Class and Principles of the Concept-Oriented Programming Computer Science Journal of Moldova, vol.13, no.3(39), 2005 Concept as a Generalization of Class and Principles of the Concept-Oriented Programming Alexandr Savinov Abstract In the paper we describe a

More information

WeSeE-Match Results for OEAI 2012

WeSeE-Match Results for OEAI 2012 WeSeE-Match Results for OEAI 2012 Heiko Paulheim Technische Universität Darmstadt paulheim@ke.tu-darmstadt.de Abstract. WeSeE-Match is a simple, element-based ontology matching tool. Its basic technique

More information

MOST attention in the literature of network codes has

MOST attention in the literature of network codes has 3862 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 Efficient Network Code Design for Cyclic Networks Elona Erez, Member, IEEE, and Meir Feder, Fellow, IEEE Abstract This paper introduces

More information

Graph Databases. Guilherme Fetter Damasio. University of Ontario Institute of Technology and IBM Centre for Advanced Studies IBM Corporation

Graph Databases. Guilherme Fetter Damasio. University of Ontario Institute of Technology and IBM Centre for Advanced Studies IBM Corporation Graph Databases Guilherme Fetter Damasio University of Ontario Institute of Technology and IBM Centre for Advanced Studies Outline Introduction Relational Database Graph Database Our Research 2 Introduction

More information

VS 3 : SMT Solvers for Program Verification

VS 3 : SMT Solvers for Program Verification VS 3 : SMT Solvers for Program Verification Saurabh Srivastava 1,, Sumit Gulwani 2, and Jeffrey S. Foster 1 1 University of Maryland, College Park, {saurabhs,jfoster}@cs.umd.edu 2 Microsoft Research, Redmond,

More information

Graph and Digraph Glossary

Graph and Digraph Glossary 1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose

More information

An Evaluation of Geo-Ontology Representation Languages for Supporting Web Retrieval of Geographical Information

An Evaluation of Geo-Ontology Representation Languages for Supporting Web Retrieval of Geographical Information An Evaluation of Geo-Ontology Representation Languages for Supporting Web Retrieval of Geographical Information P. Smart, A.I. Abdelmoty and C.B. Jones School of Computer Science, Cardiff University, Cardiff,

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Text Mining. Representation of Text Documents

Text Mining. Representation of Text Documents Data Mining is typically concerned with the detection of patterns in numeric data, but very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data,

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

Semantics and Ontologies for Geospatial Information. Dr Kristin Stock

Semantics and Ontologies for Geospatial Information. Dr Kristin Stock Semantics and Ontologies for Geospatial Information Dr Kristin Stock Introduction The study of semantics addresses the issue of what data means, including: 1. The meaning and nature of basic geospatial

More information

Description Logics and OWL

Description Logics and OWL Description Logics and OWL Based on slides from Ian Horrocks University of Manchester (now in Oxford) Where are we? OWL Reasoning DL Extensions Scalability OWL OWL in practice PL/FOL XML RDF(S)/SPARQL

More information

Towards Rule Learning Approaches to Instance-based Ontology Matching

Towards Rule Learning Approaches to Instance-based Ontology Matching Towards Rule Learning Approaches to Instance-based Ontology Matching Frederik Janssen 1, Faraz Fallahi 2 Jan Noessner 3, and Heiko Paulheim 1 1 Knowledge Engineering Group, TU Darmstadt, Hochschulstrasse

More information

0. Database Systems 1.1 Introduction to DBMS Information is one of the most valuable resources in this information age! How do we effectively and efficiently manage this information? - How does Wal-Mart

More information

Symbol Tables Symbol Table: In computer science, a symbol table is a data structure used by a language translator such as a compiler or interpreter, where each identifier in a program's source code is

More information

LOGIC AND DISCRETE MATHEMATICS

LOGIC AND DISCRETE MATHEMATICS LOGIC AND DISCRETE MATHEMATICS A Computer Science Perspective WINFRIED KARL GRASSMANN Department of Computer Science University of Saskatchewan JEAN-PAUL TREMBLAY Department of Computer Science University

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

5 The Control Structure Diagram (CSD)

5 The Control Structure Diagram (CSD) 5 The Control Structure Diagram (CSD) The Control Structure Diagram (CSD) is an algorithmic level diagram intended to improve the comprehensibility of source code by clearly depicting control constructs,

More information

Multi-agent and Semantic Web Systems: RDF Data Structures

Multi-agent and Semantic Web Systems: RDF Data Structures Multi-agent and Semantic Web Systems: RDF Data Structures Fiona McNeill School of Informatics 31st January 2013 Fiona McNeill Multi-agent Semantic Web Systems: RDF Data Structures 31st January 2013 0/25

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

Falcon-AO: Aligning Ontologies with Falcon

Falcon-AO: Aligning Ontologies with Falcon Falcon-AO: Aligning Ontologies with Falcon Ningsheng Jian, Wei Hu, Gong Cheng, Yuzhong Qu Department of Computer Science and Engineering Southeast University Nanjing 210096, P. R. China {nsjian, whu, gcheng,

More information

Graphical Notation for Topic Maps (GTM)

Graphical Notation for Topic Maps (GTM) Graphical Notation for Topic Maps (GTM) 2005.11.12 Jaeho Lee University of Seoul jaeho@uos.ac.kr 1 Outline 2 Motivation Requirements for GTM Goals, Scope, Constraints, and Issues Survey on existing approaches

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

6.001 Notes: Section 1.1

6.001 Notes: Section 1.1 6.001 Notes: Section 1.1 Slide 1.1.1 This first thing we need to do is discuss the focus of 6.001. What is this course all about? This seems quite obvious -- this is a course about computer science. But

More information

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL David Parapar, Álvaro Barreiro AILab, Department of Computer Science, University of A Coruña, Spain dparapar@udc.es, barreiro@udc.es

More information

ArgQL: A Declarative Language for Querying Argumentative Dialogues

ArgQL: A Declarative Language for Querying Argumentative Dialogues ArgQL: A Declarative Language for Querying Argumentative Dialogues R U L E M L + R R : I N T E R N AT I O N A L J O I N T C O N F E R E N C E O N R U L E S A N D R E A S O N I N G D I M I T R A Z O G R

More information

CPS122 Lecture: From Python to Java last revised January 4, Objectives:

CPS122 Lecture: From Python to Java last revised January 4, Objectives: Objectives: CPS122 Lecture: From Python to Java last revised January 4, 2017 1. To introduce the notion of a compiled language 2. To introduce the notions of data type and a statically typed language 3.

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

Optimised Classification for Taxonomic Knowledge Bases

Optimised Classification for Taxonomic Knowledge Bases Optimised Classification for Taxonomic Knowledge Bases Dmitry Tsarkov and Ian Horrocks University of Manchester, Manchester, UK {tsarkov horrocks}@cs.man.ac.uk Abstract Many legacy ontologies are now being

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval (Supplementary Material) Zhou Shuigeng March 23, 2007 Advanced Distributed Computing 1 Text Databases and IR Text databases (document databases) Large collections

More information

Question Answering Systems

Question Answering Systems Question Answering Systems An Introduction Potsdam, Germany, 14 July 2011 Saeedeh Momtazi Information Systems Group Outline 2 1 Introduction Outline 2 1 Introduction 2 History Outline 2 1 Introduction

More information

The Semantic Planetary Data System

The Semantic Planetary Data System The Semantic Planetary Data System J. Steven Hughes 1, Daniel J. Crichton 1, Sean Kelly 1, and Chris Mattmann 1 1 Jet Propulsion Laboratory 4800 Oak Grove Drive Pasadena, CA 91109 USA {steve.hughes, dan.crichton,

More information