Enriching knowledge graphs with text processing techniques

Size: px

Start display at page:

Download "Enriching knowledge graphs with text processing techniques"

Alexia Quinn
5 years ago
Views:

1 Enriching knowledge graphs with text processing techniques ERCIM News 111: J.-M. Le Goff, CERN A. Rattinger CERN & Graz University of Technology I-Know, Graz, Austria 12 October

2 Agenda Our approach: Building knowledge graph and graph data model Identifying Concepts and relationships in full text Enriching knowledge graph Text mining for graph extensions (preliminary) Use case: patents and publications analytics iknow2017 2

3 Heterogeneous data sources Text Data model 3

4 Why building a knowledge graph out of datasets? Datasets contain valuable information for visual analytics Businesses, applications, domains, etc. Datasets are difficult to use directly for visual analytics They contain complex structures with various data types They come with different data representations Structured, semi-structured and unstructured Only subsets are of interest for a particular analysis Need domain specific information to understand analytics output Build data network with elements of interest in datasets Vertices: data instances with labelled data types Relationships: interconnectivity iknow2017 4

Data network stored as a graph Graphs are natural representations of large and interconnected networks Complexity Interconnectivity Scalability Multi dimensionality Data model is embedded in the

5 Data network stored as a graph Graphs are natural representations of large and interconnected networks Complexity Interconnectivity Scalability Multi dimensionality Data model is embedded in the graph itself Node and relationship labels Compact graph structure Graph query language No need for schema evolution Graphs of connected elements constitute multi-dimensional networks Data model: labels and relationships Labels Graph dimensions Relationships Interconnectivity between Labels Data Network = Knowledge graph 5

6 Knowledge graph Data model Graph of data instances and relationships Graph data model: Schema Data model embedded in knowledge graph 6

7 Data source 1 Knowledge Graph Data source n Processing Populating Organising Labelling Visual analytics Data sources Visual analytics is performed on the network using its schema 7

8 Labelling Labelling Vertices Semi-structured data: Metadata Structural information Tags labels Structured data: Relational Databases tables, fields labels Text processing to create new labels, new vertices Labelling Relationships Semi-structured data: Relationships from nested tags (Has, ispartof, etc.) Structured data: Relational Databases No labels in E-R Models Vertex Labels + text information to label relationships Ex: IsA, send, receive, live, own, etc. 12/10/2017 iknow2017 8

9 Ex: Publications/Patents Metadata Published Items Publications: (Scopus, WoK, etc.) Organisation address (in data) Keyword (in data) Category (in data) Journal Category Patents: (PatStat, etc.) Organisation Address (in data) Category (in data) Patent class 12/10/2017 iknow2017 9

10 Data Model from Metadata tags Cat: Scat PubItem: Pub KW Org: Addre ss Cat: PatClas PubItem: Pat Document metadata Kw: Keyword, PubItem: Published Item OrgAdd: Organisation Address Data Model: Graph of labels 10

11 Publications/Patents Metadata Published Items Publications: Organisation address Text processing Organisation (from other data sources: Company, Institute) City Country Keyword (in data) Category (in data) Journal Category Patents: Organisation Address Text processing Organisation (from other data sources: Company, Institute) City Country Category (in data) Patent class 12/10/2017 iknow

Exploiting text information Cat: Scat PubItem: Pub:

Address isa Org: Inst Cat: PatClas PubItem: Pat

Keyword, Org: Organisation, Inst: Institute, Comp:

12 Exploiting text information Cat: Scat PubItem: Pub: WoK KW isa Org: Comp Cny islocated Cty islocated Org Address isa Org: Inst Cat: PatClas PubItem: Pat Document metadata Data Model: Graph of labels Kw: Keyword, Org: Organisation, Inst: Institute, Comp: Company, Cny: Country Cty: City, OrgAdd: Organisation Address 12

13 Data sources: - patent & publications metadata - Patent full text (USPTO) Preliminary work 13

14 Publications/Patents analytics Use case: Who are the key organisations active in a particular technology? Motivations Technology monitoring, How an emerging technology is evolving (Research Industry) foresight studies, looking for partners, join collaborations Company, institution landscape iknow

15 Publications/Patents analytics (2) Use case: What is the organisation landscape of a technology? Technique: Search for pub/pat matching technology terms Titles and/or abstracts Issues Quality of the technology terms to identify a technology Search terms may not correspond to a single technology Some pertinent publications and patents may not contain the technology terms Use text processing to address these issues iknow

16 Add search output to knowledge graph Cat: Scat PubItem: Pub: WoK KW isa Org: Comp Cny islocated Cty islocated Org Address Search isa Org: Inst Cat: PatClas PubItem: Pat Search output: A subset of publications and patents matching technology terms 16

17 Illustrating the approach Through Silicon Via Wikipedia Search terms on titles of pub/pat: Through Silicon Via Exact matching High quality output TSV More but with lower quality output iknow

18 Technology: Through Silicon Via (TSV) iknow

19 Keywords: Through Silicon Via (title) iknow

20 Keywords: TSV (title) TSV also means: Taura Syndrome Virus iknow

21 Methodology Preliminary work 21

22 Approach Index Patents Patent specific preprocessing Create document embedding / Feature vectors (Doc2Vec) Dimensionality reduction / Manifold learning (t-sne, LargeVis) Visualization (Datashader Large scale visualization) iknow

23 Index Patents / Preprocessing Current Dataset: USPTO Patents from Candidate Generation: Patents are indexed for fuzzy search (lucene) Preprocessing Clean HTML syntax Remove references, stopwords iknow

24 Document Embedding Numeric representation of text documents Gensim Implementation - Based on word2vec-cbow Concept: Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. iknow

25 Dimensionality Reduction Our dataset contains 2 million points with 300 Dimensions Many costly techniques (MDS, t-sne) iknow

Large Scale Visualization Datashader Rasterization pipeline Handles large amounts of data Patent Class Labels A: Human necessities B: Performing operations; transporting C: Chemistry;

26 Large Scale Visualization Datashader Rasterization pipeline Handles large amounts of data Patent Class Labels A: Human necessities B: Performing operations; transporting C: Chemistry; Metallurgy D: Textiles; Paper E: Fixed Constructions F: Mechanical engineering; lighting; G: Physics H: Electricity Colours correspond to distinct labels (2,022,349 Patents) iknow

27 Full text processing Patent full text: Title + Abstract + Description USPTO full text patent Objective 1: Look for pertinent patents that do not contain Through Silicon Via 2: Look for different meanings of TSV iknow

28 International Patent Classification (IPC) Sections from A ( Human Necessities ) to H ( Electricity ) Classes (A01 "Agriculture; forestry; animal husbandry; trapping; fishing") Subclasses, Group Number, Subgroup Example: H01L 23/00 (Details of semiconductor or other solid state devices) iknow

29 Patent Class G (Physics) Overview 378,692 Patents iknow

30 Patent Search (TSV or Through Silicon Via) Most prominent patent classes (1696 results) Document Embedding iknow

31 Comparison of Through Silicon Via vs TSV Relevant data differs for relevant search results Through Silicon Via (954 Results) TSV (982 Results) iknow

32 Through Silicon Via vs TSV Relevant documents to the search for different terms Through Silicon Via (954 Results) TSV (982 Results) iknow

33 Derive new relationships Cat: Scat PubItem: Pub: WoK KW isa Org: Comp Cny islocated Cty islocated Org Address Search isa Org: Inst Cat: PatClas PubItem: Pat iknow

34 Search extended to publications Publications: Titles + Abstracts Publications offer a different viewpoint Publications are classified according to the n-closest patent classes G6F: Electric Digital Data Processing (Through Silicon Via (Antenna)) A61K: Medical or Veterinary Science (Taura-Syndrom-Virus, tachycardia beat, Tellerspülvermögens) iknow

35 Conclusion Networks populated with metadata need to be enriched to properly support Visual Analytics Enrichment can come from additional data sources or from text processing on the documents referenced in the metadata Preliminary text processing results on patents (title, abstract and description) indicate that it is possible to: Enrich a patent set w.r.t. a search result Regroup patents via patent categories showing different meaning of search terms Link some of the publications with nearby patents iknow

36 Thank you for your attention!

Big Data analytics and Visualization

Big Data analytics and Visualization MTA Cloud symposium A. Agocs, D. Dardanis, R. Forster, J.-M. Le Goff, X. Ouvrard CERN MTA Head quarters, Budapest, 17 February 2017 1 Background information Collaboration