Rapid Information Discovery System (RAID)

Size: px

Start display at page:

Download "Rapid Information Discovery System (RAID)"

Bernard Wilkins
6 years ago
Views:

1 Int'l Conf. Artificial Intelligence ICAI' Rapid Information Discovery System (RAID) B. Gopal, P. Benjamin, and K. Madanagopal Knowledge Based Systems, Inc. (KBSI), College Station, TX, USA Summary - This paper describes the motivations, solution concepts, and architecture of a framework for a Rapid Information Discovery System (RAID), to support semantic enterprise search and knowledge discovery from large volumes of multi-source text data. First, the overall solution concept is summarized. An ontology-driven approach to natural language processing (NLP) is described. Then the RAID architecture for semantic indexing and semantic search is summarized. The RAID human machine interface and knowledge extraction methods are outlined. Dynamic learning and user feedback-driven adaptation mechanisms are then summarized. Finally, conclusions and opportunities for further R&D are summarized. The concepts described in this paper provide a new approach and solution architecture for iterative and adaptive discovery of information content associated with imprecisely specified descriptions of end user information needs. Key Words: Semantic Search, Natural Language Processing, Knowledge Discovery, Ontologies, Semantic Technologies 1. Background and Motivations Information analysts are slowly drowning in a flood of human and computer generated information. The constantly increasing volumes and velocities of data make it increasingly difficult to identify and utilize key information; hence, making timely and accurate information extraction a very challenging problem. Improvements in semantic modeling, natural language processing, and collaboration technology may provide significant leverage to address this problem. If information analysts were able to express their information needs in plain terms, understood by a search engine as guidance or examples, documents and other information artifacts might be brought to light that simply guessing at appropriate keywords would never elicit. If a search engine to find similar documents could use examples of sought-after text, analysts could focus their attention on interpreting and utilizing the information rather than formulating search criteria. Because information analysts seldom work in isolation, a shared understanding of analysis goals and subsequent sharing of knowledge and effort can significantly improve analytic outcomes. The same technology that would make effective search by example a reality could be used to derive an understanding of analysts goals directly from their search activities. Such an understanding could easily be used to connect analysts working on similar problems or to alert analysts to progress made by others in related areas. Information Retrieval (IR) refers to the activity of obtaining relevant information resources from a collection. Information Extraction (IE) refers to the activity of extracting structured information from an unstructured or semi-structured information resource. In this era of information overload, there is a need to combine the best of both IR and IE to support the timely and accurate discovery of relevant information buried in large volumes of data, regardless of the domain of interest. Mechanisms are also needed that rapidly learn the context and intent of the agents tasks and progressively enhance the quality of information discovered to address their evolving information needs. IR tools typically do not require any customization when applied to new domains and provide a broad coverage by enabling a user to cast a wide net during the search process. Although such systems make it easier to uncover new information, precise information extraction fails because entities, events, and their relationships are not identified. On the other hand, IE tools excel at generating structured data from unstructured text; and hence, identifying relevant and precise concepts. IE tools unfortunately suffer from being overly specific to a domain, invariably requiring much customization to support other domains. According to Etzioni et. al [1], Information Extraction (IE) has traditionally relied on extensive human involvement in the form of hand-crafted extraction rules or hand-tagged training examples. Because IE tools have a narrower coverage, it is harder to stumble upon new information.

2 322 Int'l Conf. Artificial Intelligence ICAI'17 There is a need to combine the best of both IR and IE spectrums to support the discovery of information in large corpora in any domain. Adequate semantic tagging and analysis methods are needed that would intelligently find useful nuggets of information from text corpora through natural language-based semantic analysis of informal end user descriptions or queries. Other research initiatives in semantic search and ontology-based querying address some aspects of the problem targeted by the RAID framework. Representative examples of related research include Simple HTML Ontology Extension (SHOE) [2], TAP [3], Intelligent Semantic Web Retrieval Agent (ISRA) [4], Semantic Content Organization and Retrieval Engine (SCORE) [5], Unsupervised Learning of Semantic Relationships [6], Ontologybased Information Extraction [7][8]. The main limitations of these approaches include (i) inability to adequately capture and end users search context, (ii) inability to address the dynamic nature of the ontologies used for search, and (iii) limited learning and adaptation abilities. The RAID framework described in this paper provides a comprehensive solution to address these semantic search challenges and research gaps. RAID supports the discovery of relevant information across large volumes of data. Moreover, the RAID approach provides mechanisms to facilitate the modeling of user information needs while using learning mechanisms to progressively improve the search results based on the users interaction with the system. 2. The RAID Solution Overview RAID provides a web-based enterprise search capability that enables focused and high precision semantic search over disparate data sources using various types of user inputs and/or user-defined ontologies. The RAID approach applies ontologydriven text analytics and NLP methods in order to extract and discover knowledge from collections of structured and unstructured data sources. The RAID solution differs from other search technologies in two important ways: (1) Support for both structured and unstructured data: Typically, enterprises have access to both unstructured data on a file system and structured data in databases. RAID can process both types of data; hence, supporting searching across various data sources. The application can index and search against the textual contents of SQL Server and Oracle databases in addition to documents on a file system. (2) Support for a variety of rich input types: RAID accepts a number of different user input types beyond basic keywords. The more input the user provides, the more accurate is the process of query formulation and hence, more semantic content may be extracted by the query builder. Consequently the search results will match the user s search tasks more accurately. Specifically, the different types of input supported include the following: Keywords: Similar to many search engines, RAID allows the user to directly input the individual terms s/he is looking for. The user also has the option of assigning a weight to each term, and to indicate whether the term must, should, or must not occur in the target data. The user also has the ability to specify exact phrase matching. Example Text: The user may also provide example text that discusses, in natural language, the concepts for which the user is searching. This text may be as short as a single sentence or as long as an entire document or an entire directory of documents. The example text is analyzed by the RAID query builder, which generates a weighted list of important tokens to search for. Ontologies: An ontology captures important concepts and relationships relevant to the domain of the user s search task. In RAID, a user can enhance the search process through the use of a specific ontology, which identifies terms of particular interest within a domain. Ontology models often provide information that associates context with a specific search and can be used to disambiguate terms and provide background knowledge that might help in interpreting content. For example, the term launch would have a different meaning to a music executive, a rocket scientist, and a web entrepreneur. Using an ontology model to assist the search affects both the content and the weighting of the term list generated by the query builder. Users can specify an ontology through a Controlled Natural Language (CNL) interface, eliminating the need to understand advanced ontology modeling concepts. Acronym Lists and Glossaries: Finally, the user can supply a set of acronym lists and/or glossaries to augment the search. Both these inputs provide additional domain information about specific terms that may appear in the search inputs or in the target data. The query builder uses this information to augment the weights and content of the term list it generates. 3. The Ontology-Driven Text Analytics Solution Concept

Int'l Conf. Artificial Intelligence ICAI'17 323 A distinguishing aspect of RAID is the use of an ontology-driven approach to text processing. First, we outline the upper level ontology used by RAID.

The OSR was designed as a resource for deep natural language understanding.

3 Int'l Conf. Artificial Intelligence ICAI' A distinguishing aspect of RAID is the use of an ontology-driven approach to text processing. First, we outline the upper level ontology used by RAID. Figure 1 provides an overview of the scope of information contained in the core Ontological Semantics Resource (OSR) upper ontology. The OSR was designed as a resource for deep natural language understanding. As such it contains approximately 27,000 lexical items (to process the information as displayed ) and 5,000 concepts (to model the information conveyed ). The data model below shows that each lexeme connects to at least one word use sense. Each word use sense may have both syntactic and semantic constraints that govern its correct occurrence (e.g., a joke can bomb and a house can be bombed, but a house can t bomb). Each word sense has one concept label that represents a concept within the OSR ontology. The Concept component of the OSR is the meaning model (ontology) of the OSR. Concepts from the OSR ontology can be event, object or property types. When a lexeme maps to multiple word senses, the syntactic (SynStruct) and semantic (SemStruct) constraints are used to disambiguate and select the proper word sense assignment. Figure 1. OSR Ontology Overview 4. RAID NLP RAID uses KBSI s Natural Language Processing Pipeline (KNLP ), an ontology driven semantic information extraction module that is designed to process unstructured data. We use the term unstructured to refer to text that is in a natural language form that conforms closely to the rules of English grammar (Figure 2). Figure 2. The RAID KNLP Pipeline As shown in Figure 2, the KNLP comprises eight stages. Each block in the pipeline is labeled according to the set of tags that are added to the input text after the input text has passed through the block. The following list provides a description of the functional blocks. Sentence Boundary Detection (SBD): This module splits the input text into sentences. Tokenization (TKN): This module splits each sentence into a set of tokens. Named Entity Recognition Level 1 (NER1): This module consists of two algorithms. The first algorithm classifies sets of adjacent tokens as PEOPLE, ORGANIZATIONS, and LOCATIONS when appropriate. The second algorithms uses a set of regular expressions to recognize a wide array of different entity types such as part numbers, system identifies, MGRS coordinates, and addresses to name a few. The set of entity types for this second algorithm is expandable based on the requirements of the input text. Named entity recognition is the first step in transforming input text into a semantic representation. Part of Speech Tagging (POS): This module labels the individual tokens with their corresponding part of speech tags. Phrase Chunking (CNK): This module groups adjacent tokens into phrases such as noun and verb phrases. Named Entity Recognition Level 2 (NER2): This module is an additional stage of named entity recognition that makes use of the phrase boundaries calculated at the CNK level. Subject-Verb-Object, Clause Identification (SVO): This module identifies the subject, verb, and object with a clause. It also identifies the boundaries between clauses. Semantic Analyzer (SEM): This module maps noun phrases and verb phrases into the concepts within the OSR ontology. The SEM component works to provide concept labels for each identified noun and verb phrase within a sentence using different semantic analysis techniques. The SEM module depends on the output from all previous stages to perform its processing. The SEM module also depends heavily on the OSR ontology to provide the syntactic and semantic constraints used in the semantic analysis calculations. The main function of the SEM component is to

and the search architecture is shown in Figure 4. Figure 3.

4 324 Int'l Conf. Artificial Intelligence ICAI'17 perform semantic disambiguation at varying levels of fidelity. 5. Semantic Indexing and Search The RAID semantic indexing is shown in Figure 3 and the search architecture is shown in Figure 4. Figure 3. RAID Semantic Indexing The main goal of the semantic indexing component is to generate a Lucene index and a triple store from the corpus of interest. The RAID index is generated by the Term Indexer component that indexes every term from the text extracted from the documents by the Text Extractor. The RAID triple store represents the central location where RDF triples are stored and is generated by the Triple Indexer. The Triple Indexer processes the output of the KNLP Pipeline and extracts semantic content from the document corpus in the form of RDF. This triple store will store and maintain knowledge from three different sources: (i) Knowledge extracted from the document corpus during indexing, (ii) Knowledge supplied by the analysts in the form of ontology input and the OSR and (iii) Knowledge inferred by the RAID system based on user feedback. Figure 4. RAID Search Architecture The RAID search architecture is summarized as follows. During a search process, a user may provide the following inputs: keywords, example text, ontologies, acronyms and glossaries. After processing these inputs through the KNLP Pipeline, the output is fed to two different pipelines. The first pipeline generates a weighted search query through the Query Builder, and the query is used to search against the RAID index. This results in one set of search results. The second pipeline involves a SPARQL Builder and a SPARQL Processor. The SPARQL Builder component analyzes the different types of user inputs (example text, keywords, ontologies, acronym lists, and glossaries) and uses the results to generate a SPARQL query that asks for the existence of certain concepts and relationships. The SPARQL Processor component serves as the interface to the triple store, processing the SPARQL queries, making inferences from the domain knowledge coded in the triple store and returning results. There are a number of third party SPARQL processors available such as JENA [9]. This results in another set of search results being generated. The two sets of search results generated by the two separate pipelines will then be merged and re-ranked by the Search Results Ranker. After viewing the search results, the user may provide some feedback about each result, which will then be persisted in the triple store. With SPARQL, the RAID system will ask for the existence of ontological concepts that match specific criteria. The result of the query will be a list of the concepts in the triple store that match the criteria. These specific concepts can then be used as search criteria in the index, and the documents containing those concepts will be returned as search results. For example, a SPARQL query may essentially ask the question Are there any instances of weapons owned by North Korea in the RDF store? During indexing, the semantic analyzer may have extracted the following knowledge: (i) No-dong is a type of missile and (ii) No-dong is owned by North Korea Additionally, through user-supplied ontologies, the RDF triple store may also contain the following item of knowledge: A missile is a type of weapon. With this knowledge present in the RDF triple store, the SPARQL query described above would result in Nodong being returned. This term could then be

Int'l Conf. Artificial Intelligence ICAI'17 325 supplied to the RAID semantic query builder and searched for in the index, resulting in a list of documents containing the concept No-dong.

5 Int'l Conf. Artificial Intelligence ICAI' supplied to the RAID semantic query builder and searched for in the index, resulting in a list of documents containing the concept No-dong. In addition to individual concepts, this approach would also work for searching for relationships (e.g., Find documents that assert that X is owned by Y, where X is a weapon ). 6. Dynamic Human Machine Interface Mechanisms The RAID framework provides intuitive Human Machine Interface (HMI) mechanisms to capture the user s queries as accurately and interactively as possible. The intuitive user interfaces are intended to augment and complement the ontology-driven text analytics algorithms. The solution architecture provides intelligent, intuitive, and interactive user interface mechanisms that assist the end user by supporting multiple steps in the knowledge discovery process. The idea is to empower the user in having an influence on the query expansion and term weighting process steps before these tasks are executed. Figure 5 shows an example of a user interface that allows data exploration in an intuitive manner. The user, who has no idea of what the dataset contains, types a query crimes in Mexico in the RAID application. The semantic tagging capability will generate the simple ontology shown below the query. Crime gets tagged as CRIMINAL_ACTIVITY and is connected to Mexico by the HAS_LOCATION relationship. It should be noted that RAID will allow the user to change the semantic tags in case of incorrect tagging. At this point, the user can expand, collapse, delete or even manually add a concept. By selecting the concept CRIMINAL_ACTIVITY and choosing the option subclasses, the user is presented with the subclasses. The user narrows down his search for one specific type of CRIMINAL_ACTIVITY, for example, DRUG_TRAFFICKING by deleting the other subclasses. Further expansion of the DRUG_TRAFFICKING concept reveals the ontology structure around that concept. At this point, if the user executes a search, he will get results as shown on the far right of Figure 5. Notice how after this exploratory phase, results that talk about marijuana, heroin and cocaine in Mexico are now returned. Figure 5. Dynamic Query Building in RAID All this interaction is stored and used later by the feedback and learning mechanism. RAID provides the ability to incrementally save user interaction information resulting in more accurate user modeling. Information saved includes user interaction history, current query, past queries in the same search session, and past queries in the entire search history. The saved results are later used by RAID to inform the learning and adaption mechanisms. 7. RAID Knowledge Extraction RAID provides the ability to extract entities and relationships and events from text data. The extracted entities and events are mapped to concepts in the OSR ontology. The mapping of unstructured text into the ontology of the OSR is performed by the KNLP Pipeline s Semantic Analysis component. The RAID approach to unsupervised relationship extraction builds upon on Bollega s [6] and Hasegawa s [10] approaches. The architecture of a Relationship Extraction pipeline is shown in Figure 6. Figure 6. Relationship Extraction Pipeline Instance Generator: This module is responsible for extracting training instances from the supplied data. This module does not require any import from the user to function. This module performs a single pass over data and extracts instances of the form {NP1}

6 326 Int'l Conf. Artificial Intelligence ICAI'17 <text> {NP2} from the text. Here NP1 and NP2 are entity pairs that can be noun phrases or named entities. NP1 and NP2 can also be extracted based on syntactic guidelines, like their head nouns have to begin with an uppercase letter. <text> represents the text between the two noun phrases, and is called the context. For the example input The market closed high because of the news of Google hiring Dr. William, the instance generator would output {Google} <hiring> {Dr. William}. Our approach uses a combination of several shallow linguistic features with a set of deep semantic features extracted by the KNLP Pipeline. In our example, the instance generator would output {Google} <hiring> {Dr. William}. The KNLP Pipeline extracts named entities along with their respective types and OSR tags. These extra features will aid the entity-pair clustering at a later stage of the text processing pipeline. The output of the Instance Generator is a context matrix. Each row in the matrix represents an entity pair, and the columns represent the context. For a row i and column j, a cell in the matrix represents the number of times entity pair i occurred with j context. Clustering: Each row in the instance matrix represents a context vector for an entity pair. This module performs clustering on all the row vectors. Entity pairs that occur in similar types of relationships will be clustered together. These clusters will be used to train a machine learning classifier. Classifier: This module uses the generated clusters to train a machine learning classifier. Each entity pair in a cluster is assigned a label yk where yk is the cluster id to which the entity pair belong to. The goal of the classifier is to learn: P(yk ei), where ei is the context vector of an entity pair. Runtime tagging: Once the model is learned, new entity pairs are extracted and classified without any input from the user. At runtime, all the noun phrases and their context vector are extracted as noted in the Instance Generator description. The clustering module are bypassed and the instances in the instance matrix are fed directly into the classifier. 8. RAID Dynamic Improvement from User Feedback Analysts have the ability to provide feedback to the RAID system, both implicitly and explicitly, starting from the first step of query building all the way to the review of search results. This feedback is processed by RAID in order to enhance the quality of queries and improve results from subsequent searches. Additionally, the system automatically detects similarities between different analysts search tasks and uses the feedback and results found by one analyst to improve the results of another analyst. Implicit feedback By using a technique very similar to the way RAID is indexing its corpus for semantic search, all queries that are issued in the tool are indexed. Subsequently, when a new query is generated, the new query is tokenized and weighted; this weighted term vector is then used to search the index of all previous queries. The search results are similar previously run queries from both, the same user as well as other collaborators. This capability can support (i) query completion by providing assistance while a user enters a query and (ii) query suggestion by suggesting useful related queries from the same user or collaborators. Explicit feedback In IR systems it is important to make use of the user s judgment on previous searches to continuously learn and enhance the performance of the IR system. Currently, the RAID tool captures user feedback through unobtrusive links in the user interface. When a search is performed, each result shows three options beside it that the user can click: Thumbs Up. Clicking this link marks the search result as relevant to the current search task. Thumbs Down. Clicking this link marks the search result as irrelevant to the current search task. Save. Clicking this saves the search result to the RAID database, associated with the user s current search task. This enables the user to revisit search results at another time, and to share results with other analysts. Any search result that is saved is also marked as relevant to the current search task. Figure 7. Process of Refining the Query through Relevance Feedback

7 Int'l Conf. Artificial Intelligence ICAI' As shown in Figure 7, the main idea behind relevance feedback is to perform an initial query, incorporate feedback from previous searches regarding which documents are relevant and which are irrelevant, and then augment the initial query by adding, removing, and re-weighting terms. Using the Rocchio Vector Space Relevance Feedback algorithm, the initial query gets modified after the relevance feedback as shown below: Q = aq + b sum(r) - c sum(s) Q: original query vector R: set of relevant document vectors S: set of non-relevant document vectors a, b, c: constants (Rocchio weights) Q : new query vector The effect of modifying the initial query is to bias the query towards more relevant documents in the document vector space. 9. Conclusions and Opportunities for Further R&D This paper described a solution architecture for iterative and adaptive discovery of information content associated with imprecisely specified descriptions of end user information needs The RAID solution provides several benefits, including (i) significant and measurable gains in the precision and the recall of searches performed by information analysts, resulting in a significant reduction in the time required to discovered relevant information; (ii) significant increase in information sharing among analysts, reducing redundant search efforts and increasing the overall quality of information discovered by a collaborating analysts; and (iii) significant increases in precision and recall of analysts searches over time as the RAID system better learns the search tasks of users through both explicit and implicit user feedback. Areas that would benefit through further R&D include (i) enhanced scalability of semantic search through the use of big data technologies; (ii) enhanced abilities to dynamically update and revise the ontology models that drive the semantic information extraction engine; and (iii) enhanced capabilities to use better use ontologies and automated reasoning techniques to improve semantic search and knowledge discovery. 10. References [1] O. Etzioni, O., Banko, M., Soderland S, & Weld D. Open Information Extraction from the Web. Communications of the ACM, Vol. 51, No. 12, December [2] Heflin, J. and Hendler, J. Searching the web with SHOE, AAAI Workshop, WS-00-01, AAAI Press, Menlo Park, CA, pp.35 40, [3] Guha, R., McCool, R. and Miller, E., Semantic search, WWW 03: Proc. of the Twelfth Int. Conf. on WWW, May, Budapest, Hungary, [4] Burton-Jones, A., Storey, V.C., Sugumaran, V. and Purao, S., A heuristic-based methodology for semantic augmentation of user queries on the web, 22nd Int. Conf. on Conceptual Modeling, Chicago, IL, USA, Oct.13 16, Proceedings, pp , [5] Sheth, A., Bertram, C., Avant, D., et. al., Managing semantic content for the web, IEEE Internet Computing, Vol. 6, No. 4, pp.80 87, [6] D. Bollegala, Y. Matsuo, & M. Ishizuka. Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web. WWW 2010, Raleigh, NC, April 26-30, [7] D. Dou, H. Wang, H. Liu. Semantic data mining: A survey of ontology-based approaches, Semantic Computing (ICSC), 2015 IEEE International Conference on, IEEE (2015), pp [8] R. Shah and S. Jain. Ontology-based Information Extraction: An Overview and a Study of different Approaches, International Journal of Computer Applications 87(4):6-8, February [9] [10] Hasegawa, T., Sekine, S., & Grishman, R. Discovering relations among named entities from large corpora. ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,