A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data

Size: px

Start display at page:

Download "A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data"

Abel Hensley
5 years ago
Views:

1 A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data Craig Knoblock & Pedro Szekely University of Southern California

2 Introduction Massive quantities of data available for analysis OSING, HUMINT, SIGINT, MASINT, GEOINT, Data is spread across multiple sources, multiple sites and multiple formats Databases, text, web sites, XML, JSON, etc... If an analyst could exploit all of this data, it could transform analysis Disruptive technology for analysis University of Southern California 2

Solution: Domain-specific Insight Graphs Innovative architecture Extracting, aligning, linking, and visualizing massive amounts of data Domain-specific content from structured and

3 Solution: Domain-specific Insight Graphs Innovative architecture Extracting, aligning, linking, and visualizing massive amounts of data Domain-specific content from structured and unstructured sources State-of-the art open source software Open architecture with flexible APIs Cloud-based infrastructure (HDFS, Hadoop, ElasticSearch, etc.) University of Southern California 3

4 Example Scenario Want to determine the nuclear know-how of a given country from open source data Analyze the universities, academics, publications, reports, articles within the country University of Southern California 4

5 Exploit the data available from Web pages, publications, articles, etc. Produce a knowledge graph Key people and connections Technical capabilities and how they have changed over time Scenario Results University of Southern California 5

6 DIG Pipeline Crawling Extracting Cleaning Integration Computing simlarity Entity resolution Graph construction Query, analysis, and visualization University of Southern California 6

7 Crawling Challenge: how to crawl just the relevant pages Approach: Uses the Apache Nutch framework for Web pages Uses Karma to extract pages from the deep Web University of Southern California 7

8 Extracting Need to produce a structured representation for indexing and linking Highly configurable architecture for extractors Learning of landmark extractors for structured data Trainable CRF-based extractors for unstructured data Uses Mechanical Turk to crowd source training data University of Southern California 8

9 Cleaning Cleaning and normalization to support analysis and linking Visualization showing data distribution Learned transformations from examples Cleaning programs written in Python University of Southern California 9

10 Integration Need to align the data across extracted data and structured sources Performed using a data integration tool called Karma Karma maps arbitrary sources into a shared domain vocabulary (schema alignment) Uses machine learning to minimize user effort University of Southern California 10

11 Integration Using Karma University of Southern California 11

12 Similarity Computes similarity across text fields and images Image similarity done using DeepSentiBank Text similarity done using Minhash/LSH University of Southern California 12

13 Entity Resolution Finds matching entities Reference source Match against source to disambiguate entities E.g., geonames for locations No reference source Combine entities by considering the similarity across multiple fields University of Southern California 13

14 Graph Construction Data is integrated into a graph that can be queries and analyzed Data stored in HDFS Data represented in a common language JSON- LD Represented using a common terminology University of Southern California 14

15 Query, Analysis and Visualization Challenge: support efficient querying against the graph Employ ElasticSearch to provide keyword querying, faceted browsing, and aggregation queries University of Southern California 15

16 Query, Analysis & Visualization Visualization interface that provides faceted queries, timeslines, maps, etc. University of Southern California 16

17 Discussion Technology that can provide dramatic new insights from data that is already available Applies to a wide range of problems Determining the nuclear know-how of a given country Technologies, key scientists, relevant organizations Combating human trafficking Understanding trends in technical areas E.g., Material Science Analyzing the competitive landscape of companies and many other domains with massive quantities of data University of Southern California 17

18 USC DIG Team University of Southern California 18

19 Acknowledgements Collaborators Next Century Technologies InferLink Inc. JPL Columbia University Sponsor DARPA AFRL contract number FA C-0240 University of Southern California 19

20 Thanks! More information: Homepage DIG isi.edu/~knoblock usc-isi-i2.github.io/dig Karma usc-isi-i2.github.io/karma University of Southern California 20

Building knowledge graphs in DIG. Pedro Szekely and Craig Knoblock University of Southern California Information Sciences Institute dig.isi.

Building knowledge graphs in DIG Pedro Szekely and Craig Knoblock University of Southern California Information Sciences Institute dig.isi.edu Goal raw messy disconnected clean organized linked hard to