User Configurable Semantic Natural Language Processing

Similar documents
Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Profiles Research Networking Software API Guide

Text Mining. Representation of Text Documents

Text Mining for Software Engineering

Knowledge Base for Business Intelligence

Natural Language Processing with PoolParty

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Object-oriented Compiler Construction

Semantic Web. Ontology Engineering and Evaluation. Morteza Amini. Sharif University of Technology Fall 93-94

IT1105 Information Systems and Technology. BIT 1 ST YEAR SEMESTER 1 University of Colombo School of Computing. Student Manual

Whitepaper on a 360 Degree Strategy for Text Analysis

Query-Time JOIN for Active Intelligence Engine (AIE)

Enhancing applications with Cognitive APIs IBM Corporation

IJCSC Volume 5 Number 1 March-Sep 2014 pp ISSN

SemSearch 2008, CEUR Workshop Proceedings, ISSN , online at CEUR-WS.org/Vol-334/ QuiKey a Demo. Heiko Haller

STS Infrastructural considerations. Christian Chiarcos

Enterprise Knowledge Map: Toward Subject Centric Computing. March 21st, 2007 Dmitry Bogachev

Jumpstarting the Semantic Web

CHAPTER 6 DATABASE MANAGEMENT SYSTEMS

Building the News Search Engine

Fractal Data Modeling

Requirements. Chapter Learning objectives of this chapter. 2.2 Definition and syntax

True Natural Language Understanding: How Does Kyndi Numeric Mapping Work?

Universal Model Framework -- An Introduction

Customisable Curation Workflows in Argo

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

SADT Structured Analysis & Design Technique

Unlocking the full potential of location-based services: Linked Data driven Web APIs

Developing SQL Data Models

OWL as a Target for Information Extraction Systems

Introduction to RDF and the Semantic Web for the life sciences

Semantic Annotation, Search and Analysis

Finding Topic-centric Identified Experts based on Full Text Analysis

Semantic Web. Ontology Engineering and Evaluation. Morteza Amini. Sharif University of Technology Fall 95-96

Orchestrating Music Queries via the Semantic Web

FINANCIAL REGULATORY REPORTING ACROSS AN EVOLVING SCHEMA

Semantic Web Mining and its application in Human Resource Management

Semantics In Action For Proactive Policing

Model driven paradigm

Setting up a CIDOC CRM Adoption and Use Strategy CIDOC CRM: Success Stories, Challenges and New Perspective

A semantic approach for discovering egovernment services

CA ERwin Data Modeler

Precise Medication Extraction using Agile Text Mining

Electronic Health Records with Cleveland Clinic and Oracle Semantic Technologies

GROW YOUR BUSINESS WITH AN ALL-IN-ONE REAL ESTATE PLATFORM

Data Sheet: ITTIA ODBC. Copyright 2005 ITTIA LLC All rights reserved

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

Data and Information Integration: Information Extraction

Developing SQL Data Models

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

2 The IBM Data Governance Unified Process

CTL.SC4x Technology and Systems

Business Intelligence

Ontology Extraction from Heterogeneous Documents

<is web> Information Systems & Semantic Web University of Koblenz Landau, Germany

Web Ontology for Software Package Management

BUILDING THE SEMANTIC WEB

SharePoint 2013 Site Owner

Business Modelling. PRACTICAL OBJECT-ORIENTED DESIGN WITH UML 2e. Early phase of development Inputs: Activities: informal specification

Finding Sentiment and the Value Within

Implementing a Knowledge Database for Scientific Control Systems. Daniel Gresh Wheatland-Chili High School LLE Advisor: Richard Kidder Summer 2006

PRINCIPLES OF COMPILER DESIGN UNIT I INTRODUCTION TO COMPILERS

Semantic Parsing for Location Intelligence

Xton Access Manager GETTING STARTED GUIDE

Is Linked Data the future of data integration in the enterprise?

SC32 WG2 Metadata Standards Tutorial

A Scotas white paper September Scotas Push Connector

WELCOME TO TECH IMMERSION

Question #1: 1. The assigned readings use the phrase "Database Approach." In your own words, what is the essence of a database approach?

Natural Language Requirements

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

Event Stores (I) [Source: DB-Engines.com, accessed on August 28, 2016]

The necessity of hypermedia RDF and an approach to achieve it

Semantic Web in a Constrained Environment

QuickSpecs. ISG Navigator for Universal Data Access M ODELS OVERVIEW. Retired. ISG Navigator for Universal Data Access

Rapid Information Discovery System (RAID)

THE GETTY VOCABULARIES TECHNICAL UPDATE

Chapter. Relational Database Concepts COPYRIGHTED MATERIAL

Mining the Biomedical Research Literature. Ken Baclawski

Multi-agent and Semantic Web Systems: Linked Open Data

Ivan Herman. F2F Meeting of the W3C Business Group on Oil, Gas, and Chemicals Houston, February 13, 2012

Automile User Guide. Last updated May 2017

IBM Research Report. Overview of Component Services for Knowledge Integration in UIMA (a.k.a. SUKI)

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data

A Scotas white paper September Scotas OLS

Library of Congress BIBFRAME Pilot. NOTSL Fall Meeting October 30, 2015

Overview of Web Mining Techniques and its Application towards Web

Parallel SQL and Streaming Expressions in Apache Solr 6. Shalin Shekhar Lucidworks Inc.

Towards a Semantic Wiki Experience Desktop Integration and Interactivity in WikSAR

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

SEMANTIC SOLUTIONS FOR OIL & GAS: ROLES AND RESPONSIBILITIES

BPMN Working Draft. 1. Introduction

Fundamentals of STEP Implementation

SELF-SERVICE SEMANTIC DATA FEDERATION

4) DAVE CLARKE. OASIS: Constructing knowledgebases around high resolution images using ontologies and Linked Data

Ontology-based Architecture Documentation Approach

2012 Microsoft Corporation. All rights reserved. Microsoft, Active Directory, Excel, Lync, Outlook, SharePoint, Silverlight, SQL Server, Windows,

F08: Intro to Composition

Furl Furled Furling. Social on-line book marking for the masses. Jim Wenzloff Blog:

Transcription:

User Configurable Semantic Natural Language Processing Jason Hedges CEO and Founder Edgetide LLC info@edgetide.com (443) 616-4941

Table of Contents Bridging the Gap between Human and Machine Language... 3 Human and Machine Languages... 3 Ordenite: The Missing Link Between Human and Machine Language... 3 What is Semantic Natural Language Processing?... 3 How Ordenite Works... 4 User Configuration... 4 Lenses... 5 Entities and the Entity Ontology... 5 Activities and the Activity Ontology... 5 Building the Graph Data Structure... 7 Ordenite Implementations... 8 Question Answer System... 8 Populating a Triple Store... 9 Data Mining... 10 Facet Generation... 10 Data for Machine Analysis... 11 2

Bridging the Gap between Human and Machine Language It is estimated that unstructured information accounts for 70-90% of the data within most organizations. As computer systems advance, so too does the amount of unstructured data within the digital world. Despite the overwhelming majority of unstructured text within an enterprise, there are few tools that allow a computer system to have a deep understanding of what the text describes. Human and Machine Languages Human languages describe entities and activities and their relationship to each other. Whether someone is describing a complex scientific reaction between particles or the latest blockbuster movies, they are describing entities and activities or things and things that are happening. This is how humans experience the world, with objects (including the intangible) and events. Machine languages describe logic, processes, and algorithms. Computer systems excel with structured data where they can easily use within computer programs, apply statistical models, easily search and discover the data, and display to a user in a variety of formats. However, much of the data that humans create is unstructured. This creates a gap between the majority of data and the type of data a computer system excels with. Ordenite: The Missing Link Between Human and Machine Language Edgetide spent several years researching and prototyping different tools that would allow a system to obtain a deeper understanding of unstructured text. We determined that, to bridge the gap between human and machine, a highly configurable topic based entity and activity extraction system is required because all human languages describe entities and activities. The system would also need to understand the relationships between entities and the activities. Finally, the system would need to convert the unstructured text to a data structure that a computer system could easily understand without losing any meaning from the text in the conversion. What is Semantic Natural Language Processing? Semantic Natural Language Processing (NLP) is the ability to capture the meaning of unstructured text in a way that a computer system can understand and fully take advantage of. Ordenite offers a highly configurable Semantic NLP Extraction platform that orders and unites unstructured data by determining the semantic meaning of text and building linked node graph-based data structures from the content. These data structures enable computer systems to query and analyze unstructured content. Ordenite goes beyond typical syntactic comparison of words to interpret the meaning of statements. Ordenite s design allows users to extract objects and graphs by configuring or extending user-defined lenses without software code or statistical training. Thus 3

providing a machine friendly format that captures the meaning of the text within the perspective of the configured topic area. How Ordenite Works Ordenite s patent pending methods and algorithms empower an organization to unlock their unstructured content for machine evaluation, search, and analysis. Ordenite does ship with certain lenses (configurations for specific topics) for extraction but a great advantage for the customer is the ability to modify or create new lenses. In this section we ll briefly walk through how Ordenite s configuration works and how it relates to text extraction. User Configuration One of Ordenite s most versatile features is the ability to create new configurations based on area of interest. We call these lenses because they offer a different view of the data specific to the scope of desired topic area. Each lens consists of an ontology for entities and another for activities. Ordenite has an easy to use web interface for creating new lenses or modifying existing ones. A graphical web interface (as shown in figure 1) makes it easy to create and modify ontologies for activities and entities. Users can also attach rules and operations to each entity or activity directly in the web interface. Figure 1 Entity Ontology with the Human entity highlighted 4

Lenses Ordenite divides different extraction configurations by user configurable topics of interest. We call these lenses. A lens is a configuration for a specific subject of interest. User-defined lenses allow Ordenite to provide multiple perspectives to a given corpora of input. This means that different users can interpret content based on specific interests, which enhances flexibility. Lenses can range in topics as different as sports to terrorism to finance and can be as specific or generic as needed. A lens is comprised of an entity ontology and an activity ontology. The two ontologies represent things and things that happen within a subject of interest. Users can create multiple lenses and even derive new lenses from existing ones. Entities and the Entity Ontology An entity is some sort of distinct and independent thing. As mentioned above, Ordenite s purpose is to be a highly configurable system that recognizes activities and entities and how they are related to each other. Many NLP products and tools include entities in some form. Ordenite takes the idea of entities much farther than most other because it allows you to relate the entities to their attributes and also to associated activities. Making it even more practical for custom use, Ordenite gives control of the configuration of the entities to the user per topic area. These are called lenses and are described in greater detail above. Entities can contain a set of attributes and also inherit the attributes of its parents. They can be a portion of the entity s text or another entity all together. For example, The red car was parked at the store. The entity Vehicle could have the attribute color which in this sentence would be red. The ability to attach attributes to entities is important so that the meaning of the text is maintained when translated to a graph data structure. There are many advantages of using an ontology to configure entities for a lens. An ontology is a method of modeling knowledge around a domain. Specifying the relationships between entities is valuable for recognition and graph data construction. In addition entities can inherit from parent entities within the ontology. Inheriting a parent s attributes can reduce a lot of configuration needed for an entity. For example, consider the previous example sentence The red car was parked at the store. If we were to have a parent entity named Tangible which has a definition for the attribute color, the entity Vehicle could inherit the attribute definition since it would be a child of Tangible. The real advantage is that any other entity that also inherits from Tangible would also inherit the rule. Activities and the Activity Ontology The section above regarding entities is how Ordenite extracts things. Activities are how Ordenite extracts and understands things that happen. Ordenite uses fully 5

configurable activity ontologies where rules can be attached to each activity. The activity ontology follows a hierarchical structure that allows children to inherit rules and attributes from parent objects, greatly reducing the amount of configuration for subtypes. Ordenite uses a combination of lexical items and attribute rules to determine the semantic meaning of the statement. To better explain how it works, consider the three following simple example sentences: 1) Sally Smith made Joy walk to the park. 2) Sally Smith made Joy some cookies. 3) Sally Smith made Joy happy. Each of the three above sentences has the same lexical item, the verb to make. However, each sentence has a very different meaning. Humans can determine the difference in the meaning because of context. Ordenite can do the same. The activity Force Person is identified in the first sentence. The configuration specifies 4 required rules to be met: to make or to force as the lexical item Human entity in the subject position of the statement as the Actor Human entity in the object position of the statement as the Affected Verb phrase with the Affected as the subject as the Action attribute. Ordenite can determine the position of a word in relation to the lexical item. A word or entity in the subject position is what is performing the lexical item. In the example Sally Smith is the entity performing the to make. The object position is the word or entity in which the lexical item is affecting. In the example Joy is the entity affected by the lexical item. Ordenite can determine the correct position regardless of the multiple ways a statement can be constructed. 6

Figure 2 Ordenite graph data output of Sally Smith made Joy walk to the park Building the Graph Data Structure The building blocks of a Semantic Web graph are triples, which consist of a subject, predicate, and object. To build a graph data structure the entities and activities are first extracted from the text. Once entities are populated from the rules found within the entity ontology in the lens, triples are constructed from the related attributes. Similarly, once activities have been extracted from the text they are converted into triples. The name of the activity is the subject, the name of the attribute is the predicate, and the value of the attribute is the object. The value of an activity attribute is most likely to be an entity, which enables the activities and entities to be related to one another. When the triples are merged together they create a group of interconnected nodes or graph data structure. Ordenite generates this graph data structure automatically from unstructured text based on the lens configuration. The graph can be outputted in open standard formats like RDF, N-Quads, or as JSON for easier integration into certain software code. 7

Figure 3 Graph data visualization for multiple terrorism narratives Ordenite Implementations Ordenite is and has been used in a wide variety of implementations. Ordenite is based on open standards so that it can be quickly and easily integrated into existing enterprise architectures with minimal integration. Below some interesting uses of Ordenite are highlighted. Question Answer System Ordenite was used to create a system that essentially allows a user to type a what, where, or when question and receive an answer with a snippet of the original text for reference. As shown in multiple sections above, Ordenite can convert unstructured text into a graph data structure based on lens configuration. Ordenite also has the ability to convert a human language question to a graph data query for a specific topic area (lens). To conform with open standards, Ordenite uses SPARQL as the query syntax. The ability to convert a question to SPARQL empowers users to perform complex graph queries without needing to know the syntax or even the data store ontology. While only SPARQL is supported, an API for question conversion exists to extract query parameters for other query syntaxes. 8

Figure 4 Screen shot of question answer system with the question What buildings were damaged from dynamite Populating a Triple Store As mentioned throughout this document, Ordenite outputs a graph data structure based on a user defined lens configuration. The graph data structure is outputted as JSON, RDF, or N-Quads. RDF and N-Quads can be directly inserted into most triple stores. In the figure below the Open Source triple store Sesame is shown with data ingested from Ordenite. Ordenite was used to create graph data structures from tens of thousands of narratives describing terrorism events. The graph was outputted as N-Quads so that the triple plus the context could be inserted into Sesame. Ordenite has been used to ingest unstructured text into Sesame using a variety of lenses from a wide range of unstructured sources. 9

Ordenite: User Configurable Semantic Natural Language Processing Figure 5 Screen shot of the Sesame Workbench with Ordenite ingested triples Data Mining While the actual output of Ordenite extraction is a graph data structure, Ordenite has the ability to convert the graph to single or interrelated tables. This is useful for data mining instances where the desired product might be an excel spreadsheet or even the population of a traditional relational database. Below is an example of extracting criminal activity along with the details of the crime. In this example, news stories were used as the source and Ordenite was used to mine the desired details. Figure 6 Snippet of a crimes committed table generated from news feeds Facet Generation Ordenite is easily integrated with Solr, the popular open source enterprise search platform. Ordenite has built in features to populate Solr fields, which are used in faceted search. In addition to populating facets Ordenite comes with an Open Source Solr visualization platform for user friendly Solr Search. Ordenite can populate Solr fields from entities, entity attributes, activities, and activity attributes. Ordenite can be configured to extract locations and time as well as text. Below is an example of an Ordenite and Solr integration using records describing terrorist events. The facets generated by Ordenite in this example are activity, location, date of incident, actor, victim, actor, weapon, relief organization, and terrorist group. www.edgetide.com Edgetide LLC 2015 All Rights Reserved 10

Figure 7 Screen shot of Solr Interface with Ordenite generated fields for facets Data for Machine Analysis Ordenite has extracted unstructured data for uses ranging from data science to dashboards. When text has the ability to be structured in a way that machines easily understand, it s trivial to use unstructured data in commercial and open source products and libraries that normally would be impossible to use with text. Ordenite has empowered text to be used in several proprietary and open source products and libraries. Figure 8 Screen shots of visualizations using Ordenite extracted data 11