User Configurable Semantic Natural Language Processing Jason Hedges CEO and Founder Edgetide LLC info@edgetide.com (443) 616-4941
Table of Contents Bridging the Gap between Human and Machine Language... 3 Human and Machine Languages... 3 Ordenite: The Missing Link Between Human and Machine Language... 3 What is Semantic Natural Language Processing?... 3 How Ordenite Works... 4 User Configuration... 4 Lenses... 5 Entities and the Entity Ontology... 5 Activities and the Activity Ontology... 5 Building the Graph Data Structure... 7 Ordenite Implementations... 8 Question Answer System... 8 Populating a Triple Store... 9 Data Mining... 10 Facet Generation... 10 Data for Machine Analysis... 11 2
Bridging the Gap between Human and Machine Language It is estimated that unstructured information accounts for 70-90% of the data within most organizations. As computer systems advance, so too does the amount of unstructured data within the digital world. Despite the overwhelming majority of unstructured text within an enterprise, there are few tools that allow a computer system to have a deep understanding of what the text describes. Human and Machine Languages Human languages describe entities and activities and their relationship to each other. Whether someone is describing a complex scientific reaction between particles or the latest blockbuster movies, they are describing entities and activities or things and things that are happening. This is how humans experience the world, with objects (including the intangible) and events. Machine languages describe logic, processes, and algorithms. Computer systems excel with structured data where they can easily use within computer programs, apply statistical models, easily search and discover the data, and display to a user in a variety of formats. However, much of the data that humans create is unstructured. This creates a gap between the majority of data and the type of data a computer system excels with. Ordenite: The Missing Link Between Human and Machine Language Edgetide spent several years researching and prototyping different tools that would allow a system to obtain a deeper understanding of unstructured text. We determined that, to bridge the gap between human and machine, a highly configurable topic based entity and activity extraction system is required because all human languages describe entities and activities. The system would also need to understand the relationships between entities and the activities. Finally, the system would need to convert the unstructured text to a data structure that a computer system could easily understand without losing any meaning from the text in the conversion. What is Semantic Natural Language Processing? Semantic Natural Language Processing (NLP) is the ability to capture the meaning of unstructured text in a way that a computer system can understand and fully take advantage of. Ordenite offers a highly configurable Semantic NLP Extraction platform that orders and unites unstructured data by determining the semantic meaning of text and building linked node graph-based data structures from the content. These data structures enable computer systems to query and analyze unstructured content. Ordenite goes beyond typical syntactic comparison of words to interpret the meaning of statements. Ordenite s design allows users to extract objects and graphs by configuring or extending user-defined lenses without software code or statistical training. Thus 3
providing a machine friendly format that captures the meaning of the text within the perspective of the configured topic area. How Ordenite Works Ordenite s patent pending methods and algorithms empower an organization to unlock their unstructured content for machine evaluation, search, and analysis. Ordenite does ship with certain lenses (configurations for specific topics) for extraction but a great advantage for the customer is the ability to modify or create new lenses. In this section we ll briefly walk through how Ordenite s configuration works and how it relates to text extraction. User Configuration One of Ordenite s most versatile features is the ability to create new configurations based on area of interest. We call these lenses because they offer a different view of the data specific to the scope of desired topic area. Each lens consists of an ontology for entities and another for activities. Ordenite has an easy to use web interface for creating new lenses or modifying existing ones. A graphical web interface (as shown in figure 1) makes it easy to create and modify ontologies for activities and entities. Users can also attach rules and operations to each entity or activity directly in the web interface. Figure 1 Entity Ontology with the Human entity highlighted 4
Lenses Ordenite divides different extraction configurations by user configurable topics of interest. We call these lenses. A lens is a configuration for a specific subject of interest. User-defined lenses allow Ordenite to provide multiple perspectives to a given corpora of input. This means that different users can interpret content based on specific interests, which enhances flexibility. Lenses can range in topics as different as sports to terrorism to finance and can be as specific or generic as needed. A lens is comprised of an entity ontology and an activity ontology. The two ontologies represent things and things that happen within a subject of interest. Users can create multiple lenses and even derive new lenses from existing ones. Entities and the Entity Ontology An entity is some sort of distinct and independent thing. As mentioned above, Ordenite s purpose is to be a highly configurable system that recognizes activities and entities and how they are related to each other. Many NLP products and tools include entities in some form. Ordenite takes the idea of entities much farther than most other because it allows you to relate the entities to their attributes and also to associated activities. Making it even more practical for custom use, Ordenite gives control of the configuration of the entities to the user per topic area. These are called lenses and are described in greater detail above. Entities can contain a set of attributes and also inherit the attributes of its parents. They can be a portion of the entity s text or another entity all together. For example, The red car was parked at the store. The entity Vehicle could have the attribute color which in this sentence would be red. The ability to attach attributes to entities is important so that the meaning of the text is maintained when translated to a graph data structure. There are many advantages of using an ontology to configure entities for a lens. An ontology is a method of modeling knowledge around a domain. Specifying the relationships between entities is valuable for recognition and graph data construction. In addition entities can inherit from parent entities within the ontology. Inheriting a parent s attributes can reduce a lot of configuration needed for an entity. For example, consider the previous example sentence The red car was parked at the store. If we were to have a parent entity named Tangible which has a definition for the attribute color, the entity Vehicle could inherit the attribute definition since it would be a child of Tangible. The real advantage is that any other entity that also inherits from Tangible would also inherit the rule. Activities and the Activity Ontology The section above regarding entities is how Ordenite extracts things. Activities are how Ordenite extracts and understands things that happen. Ordenite uses fully 5
configurable activity ontologies where rules can be attached to each activity. The activity ontology follows a hierarchical structure that allows children to inherit rules and attributes from parent objects, greatly reducing the amount of configuration for subtypes. Ordenite uses a combination of lexical items and attribute rules to determine the semantic meaning of the statement. To better explain how it works, consider the three following simple example sentences: 1) Sally Smith made Joy walk to the park. 2) Sally Smith made Joy some cookies. 3) Sally Smith made Joy happy. Each of the three above sentences has the same lexical item, the verb to make. However, each sentence has a very different meaning. Humans can determine the difference in the meaning because of context. Ordenite can do the same. The activity Force Person is identified in the first sentence. The configuration specifies 4 required rules to be met: to make or to force as the lexical item Human entity in the subject position of the statement as the Actor Human entity in the object position of the statement as the Affected Verb phrase with the Affected as the subject as the Action attribute. Ordenite can determine the position of a word in relation to the lexical item. A word or entity in the subject position is what is performing the lexical item. In the example Sally Smith is the entity performing the to make. The object position is the word or entity in which the lexical item is affecting. In the example Joy is the entity affected by the lexical item. Ordenite can determine the correct position regardless of the multiple ways a statement can be constructed. 6
Figure 2 Ordenite graph data output of Sally Smith made Joy walk to the park Building the Graph Data Structure The building blocks of a Semantic Web graph are triples, which consist of a subject, predicate, and object. To build a graph data structure the entities and activities are first extracted from the text. Once entities are populated from the rules found within the entity ontology in the lens, triples are constructed from the related attributes. Similarly, once activities have been extracted from the text they are converted into triples. The name of the activity is the subject, the name of the attribute is the predicate, and the value of the attribute is the object. The value of an activity attribute is most likely to be an entity, which enables the activities and entities to be related to one another. When the triples are merged together they create a group of interconnected nodes or graph data structure. Ordenite generates this graph data structure automatically from unstructured text based on the lens configuration. The graph can be outputted in open standard formats like RDF, N-Quads, or as JSON for easier integration into certain software code. 7
Figure 3 Graph data visualization for multiple terrorism narratives Ordenite Implementations Ordenite is and has been used in a wide variety of implementations. Ordenite is based on open standards so that it can be quickly and easily integrated into existing enterprise architectures with minimal integration. Below some interesting uses of Ordenite are highlighted. Question Answer System Ordenite was used to create a system that essentially allows a user to type a what, where, or when question and receive an answer with a snippet of the original text for reference. As shown in multiple sections above, Ordenite can convert unstructured text into a graph data structure based on lens configuration. Ordenite also has the ability to convert a human language question to a graph data query for a specific topic area (lens). To conform with open standards, Ordenite uses SPARQL as the query syntax. The ability to convert a question to SPARQL empowers users to perform complex graph queries without needing to know the syntax or even the data store ontology. While only SPARQL is supported, an API for question conversion exists to extract query parameters for other query syntaxes. 8
Figure 4 Screen shot of question answer system with the question What buildings were damaged from dynamite Populating a Triple Store As mentioned throughout this document, Ordenite outputs a graph data structure based on a user defined lens configuration. The graph data structure is outputted as JSON, RDF, or N-Quads. RDF and N-Quads can be directly inserted into most triple stores. In the figure below the Open Source triple store Sesame is shown with data ingested from Ordenite. Ordenite was used to create graph data structures from tens of thousands of narratives describing terrorism events. The graph was outputted as N-Quads so that the triple plus the context could be inserted into Sesame. Ordenite has been used to ingest unstructured text into Sesame using a variety of lenses from a wide range of unstructured sources. 9
Ordenite: User Configurable Semantic Natural Language Processing Figure 5 Screen shot of the Sesame Workbench with Ordenite ingested triples Data Mining While the actual output of Ordenite extraction is a graph data structure, Ordenite has the ability to convert the graph to single or interrelated tables. This is useful for data mining instances where the desired product might be an excel spreadsheet or even the population of a traditional relational database. Below is an example of extracting criminal activity along with the details of the crime. In this example, news stories were used as the source and Ordenite was used to mine the desired details. Figure 6 Snippet of a crimes committed table generated from news feeds Facet Generation Ordenite is easily integrated with Solr, the popular open source enterprise search platform. Ordenite has built in features to populate Solr fields, which are used in faceted search. In addition to populating facets Ordenite comes with an Open Source Solr visualization platform for user friendly Solr Search. Ordenite can populate Solr fields from entities, entity attributes, activities, and activity attributes. Ordenite can be configured to extract locations and time as well as text. Below is an example of an Ordenite and Solr integration using records describing terrorist events. The facets generated by Ordenite in this example are activity, location, date of incident, actor, victim, actor, weapon, relief organization, and terrorist group. www.edgetide.com Edgetide LLC 2015 All Rights Reserved 10
Figure 7 Screen shot of Solr Interface with Ordenite generated fields for facets Data for Machine Analysis Ordenite has extracted unstructured data for uses ranging from data science to dashboards. When text has the ability to be structured in a way that machines easily understand, it s trivial to use unstructured data in commercial and open source products and libraries that normally would be impossible to use with text. Ordenite has empowered text to be used in several proprietary and open source products and libraries. Figure 8 Screen shots of visualizations using Ordenite extracted data 11