Data-Mining Algorithms with Semantic Knowledge

Size: px

Start display at page:

Download "Data-Mining Algorithms with Semantic Knowledge"

Marian Lewis
5 years ago
Views:

1 Data-Mining Algorithms with Semantic Knowledge Ontology-based information extraction Carlos Vicient Monllaó Universitat Rovira i Virgili December, 14th Poznan A Project funded by the Ministerio de Ciencia e Innovación and Universitat Rovira i Virgili DAMASK, 2010

Contents 1. DAMASK 1. Introduction 2. Goals 3. Working plan 2. Ontology-based information extraction (Task 1) 1. State of art 1. IR vs IE 2. Ontology-based IE 2.

2 Contents 1. DAMASK 1. Introduction 2. Goals 3. Working plan 2. Ontology-based information extraction (Task 1) 1. State of art 1. IR vs IE 2. Ontology-based IE 2. Main methodology 3. Step by step methodology 1. Named entities detection 2. Discovering entity-subsumer concept (Candidates extraction) 3. Semantic annotation 2

3 1.- DAMASK DATA MINING ALGORITHMS WITH SEMANTIC KNOWLEDGE 3

4 INTRODUCTION Data-Mining Algorithms with Semantic Knowledge Founded by Ministerio de Ciencia e Innovación and Universitat Rovira i Virgili Main motivations: Explosive growth in the amount of information available on networked computers around the world, much of it in the form of natural language documents Increasing interest in semantic web contents => Semantic Knowledge Lack of use of domain knowledge of traditional data mining methods 4

GOALS Processing and extraction of Web resources based on ontologies. Extraction of relevant data from a domain of structured, semi-structured and unstructured Web resources.

5 GOALS Processing and extraction of Web resources based on ontologies. Extraction of relevant data from a domain of structured, semi-structured and unstructured Web resources. Semantic integration of information in an attribute-value matrix that can be used for further clustering methods. Performing an automatic classification of data (clustering method based on ontologies) Adaptation of traditional clustering methods to create classifications (trees and partitions) using semantic information. Definition of methods to analyse automatically the clusters obtained from previous step Test the practical applicability of the developed methods in the strategic area of Tourism 5

6 WORK PLAN I The project is divided into 3 main task: Task 1 - Ontology-based information extraction and integration from heterogeneous Web resources Task 2 - Automatic clustering of entities based on the semantics of the concepts and attributes obtained from the Web resources Task 3 - Application of the developed methods to a Tourism test case 6

7 WORK PLAN I The project is divided into 3 main task: Task 1 - Ontology-based information extraction and integration from heterogeneous Web resources Task 2 - Automatic clustering of entities based on the semantics of the concepts and attributes obtained from the Web resources Task 3 - Application of the developed methods to a Tourism test case 7

8 WORK PLAN (Task 1) II The key point of this task is to complement the syntactical parsing and natural language processing techniques with the knowledge contained in one or several input ontologies in order to be able to: Identify relevant features describing a particular entity from textual data Associate, if applicable, extracted features to concepts contained in the input ontologies. 8

9 2.- ONTOLOGY-BASED INFORMATION EXTRACTION 9

10 STATE OF ART (IR vs IE) I IR simply finds texts and presents them to the user (as classic search engines) Information Extraction (IE) is the task of locating specific pieces of data within a natural language document IE analyses texts and presents only the specific information extracted from the text that is of interest to a user Wrapper : a set of extraction rules suitable to extract information from a Web site. Two main approaches: Knowledge engineering supervised, traditional IE Automatic training unsupervised, open IE 10

11 STATE OF ART (IR vs IE) I IR simply finds texts and presents them to the user (as classic search engines) Information Extraction (IE) is the task of locating specific pieces of data within a natural language document IE analyses texts and presents only the specific information extracted from the text that is of interest to a user Wrapper : a set of extraction rules suitable to extract information from a Web site. Two main approaches: Knowledge engineering supervised, traditional IE Automatic training unsupervised, open IE 11

12 STATE OF ART (IR vs IE) II Comparison of tradition IE and Open IE 12

13 STATE OF ART (Ontology-Based IE) III Ontology-Based IE (Motivations): Growing interest in the research community in developing data mining techniques Textual documents describing a particular entity are difficult to process in order to extract relevant features which could be exploited in order to apply semantically focused data mining algorithms There have been many conceptual approximations in the field of Semantic Web in which it is assumed that resources have been semantically annotated, in the short term future we cannot expect the availability of a massive amount of annotated Web resources Ontology Based information extraction relies on ontologies in order to interpret the textual content of a resource regardless of its format. 13

14 STATE OF ART (Ontology-Based IE) IV Ontologies have emerged as a new paradigm to model and formalize domain knowledge in a machine readable way IE and ontologies are involved in two main and related tasks. Used for: Information Extraction: IE needs ontologies as part of the understanding process for extracting the relevant information; Populating and enhancing the ontology: texts are useful sources of knowledge to design and enrich ontologies. These two tasks can be combined in a cyclic process: ontologies are used for interpreting the text at the right level for IE and IE extracts new knowledge from text, to be integrated in the ontology. 14

15 Cyclic process IE ALGORITHMS Relevant extracted information ONTOLOGY Populating and enhancing 15

16 METHODOLOGY I Task 1 methodology could be compared with respect to automatic semantic annotation of documents. 1. Named Entity detection (instances of things) 2. Discovering entity-subsumer concept (candidates from Named Entity) 3. Semantic annotation of Named Entities (Pairs of NE and candidate) 16

17 METHODOLOGY (Named Entity detection) II 17

18 METHODOLOGY (Named Entity detection) II Madrid Paris Llobregat Catalan Antoni Gaudí Sagrada Familia 18

19 METHODOLOGY (Named Entity detection) III Extracted NE Madrid Paris Llobregat Catalan Antoni Gaudí Sagrada Familia 19

20 METHODOLOGY (Named Entity detection) III Extracted NE Madrid Paris Llobregat Catalan Representative NE Catalan Llobregat Antoni Gaudí Sagrada Faminlia Antoni Gaudí Sagrada Familia 20

21 METHODOLOGY (discovering Entity-subset) IV Representative NE Sagrada Familia Catalan Llobregat Antoni Gaudí 21

22 METHODOLOGY (discovering Entity-subset) IV Representative NE Sagrada Familia Catalan Llobregat Antoni Gaudí Subset {Cathedral, church} {Language} {River, town} {Architect, person} 22

23 METHODOLOGY (Ontology Matching) V NE-Subset Sagrada Familia; {Cathedral, church} Semantic annotation Catalan; {Language} Llobregat; {River, town} Antoni Gaudí; {Architect, person} 23

24 METHODOLOGY (Ontology Matching) V NE-Subset Sagrada Familia; {Cathedral, church} Semantic annotation Catalan; {Language} Llobregat; {River, town} Antoni Gaudí; {Architect, person} 24

25 STEP by STEP (Named Entity detection) I "Named entities are phrases that contain the names of persons, organizations, locations, times, and quantities." (CoNLL 2002). Problems to detect NE: Unstructured and unlimited by nature Relationships remain hidden in the text from which the extraction has been performed. Approaches Using rules learned from pre-tagget examples => Recall problems Use a thesaurus to detect NE (if it is not found in the dictionary, it is assumed to be a NE => NE composed by common words are discarded Exploiting the way in which NE are expressed in languages such as English using heuristics => Inaccurate results. Using linguistic analyses, heuristics and statistics web 25

26 STEP by STEP (Named Entity detection) II Linguistic analysis are applied to detect NE Tool: OpenNLP => Natural language Parser Four steps: SD, TOK, TAG, CHUNK. CHUNK is able to detect NE using a database => Lower recall, Limited NE [NP The/VB gothic/jj cathedral/nn] [VP of/vb] [NP Barcelona/NNP] [NP Tarragona/EX] [VP is/nns] [NP a/jjs city/nn] Proposal: Filter noise: remove stop words, misspellings, etc. Heuristics: Select all Noun Phrases (NP) where [NP.+ Regex2: s[a-z] Problem: Not all potencial NE are representative for the analized instance. e.g. Neither Paris nor Madrid are representative for Barcelona 26

27 STEP by STEP (Named Entity detection) III In order to improve NE extraction precision it will be complemented with a Web-based reliability analysis. Wider context, i.e. several observations in heterogeneous contexts Web-based analysis approach consists in use Web-statistics to sort all NE combining Semantic Relatedness measures and hits Relatedness Measures: PMI (Pointwise mutual information) SCP (Symmetrical. Conditional Probability) NGD (Normalized Google distance) 27

28 STEP by STEP (Named Entity detection) IV PMI (Pointwise mutual information) Using hits, 28

29 STEP by STEP (Named Entity detection) V SCP (Symmetrical. Conditional Probability) NGD (Normalized Google Distance) Where, M is the total number of Internet webpages. 29

30 STEP by STEP (Named Entity detection) VI Hits for Barcelona and Sagrada Familia Hits, Similarity, Sim(Barcelona, SagradaFamilia) = * = 3,42803E-09 30

31 STEP by STEP (Named Entity detection) VII Named Entity Hits(Bcn) Hits(NE) Hits(Bcn^NE) PMI Sagrada Familia ,580461E-09 Llobregat ,336235E-09 Antoni Gaudí ,605033E-09 Madrid ,592376E-09 Paris ,789873E-10 Catalan ,772626E-10 (*) Queries has been performed using yahoo searcher engine 31

STEP by STEP (Named Entity detection) VII Named Entity Hits(Bcn) Hits(NE) Hits(Bcn^NE) PMI Sagrada Familia 295.000.000 3.560.000 2.710.000 2,580461E-09 Llobregat 295.000.000 2.220.000 1.530.

32 STEP by STEP (Named Entity detection) VII Named Entity Hits(Bcn) Hits(NE) Hits(Bcn^NE) PMI Sagrada Familia ,580461E-09 Llobregat ,336235E-09 Antoni Gaudí ,605033E-09 Madrid ,592376E-09 Paris ,789873E-10 Catalan ,772626E-10 (*) Queries has been performed using yahoo searcher engine 32

33 STEP by STEP (Named Entity detection) VIII Select representative NE using different thresholds Extracted NE Sorted by PMI Sagrada Familia Llobregat Antoni Gaudí Madrid Representative NE Sagrada Familia Llobregat Antoni Gaudí Madrid Paris Catalan Paris Catalan 33

34 STEP by STEP (Named Entity detection) VIII Select representative NE using different thresholds Extracted NE Sorted by PMI Sagrada Familia Llobregat Antoni Gaudí Madrid Representative NE Sagrada Familia Llobregat Antoni Gaudí Madrid Paris Catalan Paris Catalan 34

35 STEP by STEP (Named Entity detection) VIII Select representative NE using different thresholds Extracted NE Sorted by PMI Sagrada Familia Llobregat Antoni Gaudí Madrid Representative NE Sagrada Familia Llobregat Antoni Gaudí Madrid Paris Catalan Paris Catalan 35

36 STEP by STEP (Named Entity detection) VIII Select representative NE using different thresholds Extracted NE Sorted by PMI Sagrada Familia Llobregat Antoni Gaudí Madrid Representative NE Sagrada Familia Llobregat Antoni Gaudí Madrid Paris Catalan Paris Catalan 36

37 STEP by STEP (Discovering entity-subsumer concept) I It is needed a way to go from the instance level to the conceptual level in an unsupervised domain independent NE and subsumer concepts are related by means of taxonomic relationships Approaches: Document-based notion of term subsumption Semantic similarity according to the shared context =>Both cases require a considerable amount of document and linguistic parsing Linguistic patterns => offer a relatively high precision but suffer a low recall due to the fact that explicit linguistic patterns are rare in corpora 37

38 STEP by STEP (Discovering entity-subsumer concept) II Solution Exploiting the web in order to increase the corpus Hearst Patterns: Used to acquire hyponymy/hypernym relations from unrestricted text. NN such as NP (cities such as Tortosa) such NN as NP (such cities as Tarragona) NP or other NN (London or other cities) NP and other NN (Barcelona and other cities) NN incluiding NP (locations including Reus) NN especially NP (gothic cathedrals especially Sagrada Familia) 38

snippets for each query are analysed using a Natural

39 STEP by STEP (Discovering entity-subsumer concept) III Six queries are constructed for each NE Returned snippets for each query are analysed using a Natural Language Parser in order to extract the taxonomical relationships 39

40 STEP by STEP (Discovering entity-subsumer concept) IV Interpretation of snippets (query such as London ) a big city such as London [NP a/vbz big/jj city/nn] [PP such/pdt] [NP as/nns London/ NNP] travel topics such as London sightseeing [NP travel/nns topics/nns] [NP such/pdt] [NP as/nns London/ NNP museum/nn] 40

41 STEP by STEP (Semantic annotation) I Consist in the annotation of Named Entities with ontological classes. Approaches: Web-based statistical evaluation => Considerable amount of queries Semantically unstructured annotations covering heterogeneous domains which are hard to exploit (Barcelona is a Metropolis, Madrid is a city) Direct matching between subsumer concept and ontology class => If the subsumer concept does not appear in the ontology but their meaning is similar than one of the classes it is not annotated. Direct matching + statistics web + WordNet 41

42 STEP by STEP (Semantic annotation) II Direct Matching A stemming algorithms is applied to subsumer concepts and ontology classes in order to discover morphologically equivalent terms. e.g., city and cities Subsumer concepts are looked up in the ontology. if there is any result and it is possible to reduce the concept, it is performed and the process is repeated e.g., big city -> city Chose the best proposed annotation 42

43 STEP by STEP (Semantic annotation) III Semantic Matching For each candidate get synonyms, hyponyms and hypernims using WordNet (if there are more than 1 synset, the context is used to resolve the problem of semantic disambiguation) Candidate: church => {abbey, basilica, cathedral, kirk, place of worship, house of prayer} To perform direct matching using the new extracted subsumer candidates To choose the best annotation 43

STEP by STEP (Semantic annotation) IV Semantic disambiguation In most of cases one word could have different meanings (Polysemy) e.g., {head-> part of the body, head-> Geographic accident, etc.

44 STEP by STEP (Semantic annotation) IV Semantic disambiguation In most of cases one word could have different meanings (Polysemy) e.g., {head-> part of the body, head-> Geographic accident, etc.} When WordNet has to be used to get synonyms from one concept, it is necessary to know which WordNet synset is the most appropriate. 44

45 STEP by STEP (Semantic annotation) V To resolve this problem it is used the context from where the candidate has been extracted. e.g.: - Document: - Named Entity: Darren Aronofsky - Subsumer candidate: producer - Context: Filmmaker Darren Aronofsky commented, "I walked out of The Matrix [...] and I was thinking, 'What kind of science fiction movie can people make now? - WordNet Synsets: o S: (n) manufacturer, producer (someone who manufactures something) o S: (n) producer (someone who finds financing for and supervises the making and presentation of a show (play or film or program or similar work) o S: (n) producer (something that produces) "Maine is a leading producer of potatoes"; "this microorganism is a producer of disease" 45

46 STEP by STEP (Semantic annotation) VI Then the context is compared with each synset using cosine distance similarity measure. Context is not enough in order to decide which synset is better and, for this reason, it is increased by means of web snippets. Query: The Matrix Aron Aronofsky Snippets: [0]: Weeks before shooting his second movie, Requiem for a Dream, Darren Aronofsky took the film's star, Jared Leto, to see The Matrix at a Brooklyn mall. [1]: UPCOMING FILM PROJECTS! There's been rumors for a while now that "The Matrix" trilogy filmmakers Andy and Lana Wachowski have been developing a secret [N] 46

47 STEP by STEP (Semantic annotation) VII Finally, the synset with higher punctuation is selected (0.084) S: (n) producer (someone who finds financing for and supervises the making and presentation of a show (play or film or program or similar work) And the synonyms, hypernyms and hyponyms are extracted. Producer => {film maker, filmmaker, film producer, movie maker, theatrical producer} 47

48 STEP by STEP (Semantic annotation) VII Choose the best annotation Compare the proposed annotations by pairs Between father-child relationships, child is chosen Other relationships are solved using statistics web (PMI) 48

49 STEP by STEP (Semantic annotation) VIII monument cathedral church religious building monument X PMI -> Monument PMI -> Monument PMI -> monument cathedral PMI -> Monument X PMI -> cathedral church PMI -> Monument PMI -> cathedral X SuperClass -> cathedral Superclass -> church religious bulding PMI -> Monument Superclass -> cathedral Superclass -> church X Subsumer concept Hits(Sagrada familia) Hits(Cand) Hits(SgF^Cand) PMI monument ,50E-08 cathedral ,43E-08 church ,57E-08 religious bulding ,17E-12 49

50 50

Engineering Applications of Artificial Intelligence

Engineering Applications of Artificial Intelligence 26 (2013) 1092 1106 Contents lists available at SciVerse ScienceDirect Engineering Applications of Artificial Intelligence journal homepage: www.elsevier.com/locate/engappai