Semantic Web: Extracting and Mining Structured Data from Unstructured Content

Size: px

Start display at page:

Download "Semantic Web: Extracting and Mining Structured Data from Unstructured Content"

Janel Shonda Washington
6 years ago
Views:

1 : Extracting and Mining Structured Data from Unstructured Content Web Science Lecture Besnik Fetahu L3S Research Center, Leibniz Universität Hannover May 20, 2014

2 1 Introduction

3 1 Introduction

4 Introduction Large amounts of data. Heterogeneity of information: provenance, quality, content, representation, language etc. Unstructured vs. Structured. and. Entities, topics, relations. Use cases: translation, semantic search, etc.

5 1 Introduction

The vision The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network.

6 The vision The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term refers to W3C s vision of the Web of linked data. technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.

7 Main Components Format: turtle, n3, etc. Syntax: XML Schema Models: RDF Taxonomies: RDFS : OWL Query languages: SPARQL Interchange formats: RIF

8 Data Formats and Models XML data format

9 Data Formats and Models XML data format RDF data representation ( subject, predicate, object )

10 define the following concepts: Entities s Domains Rules Axioms

11 Representation Differences in OWL ontologies OWL-Lite (OWL-Lite OWL-DL): supports those users primarily needing a classification hierarchy and simple constraints. It supports cardinality constraints, and only permits cardinality values of 0 or 1. OWL-DL (OWL-DL OWL): supports maximum expressiveness while retaining computational completeness and decidability. OWL-DL includes all OWL language constructs, but they can be used only under certain restrictions. OWL: is meant for users who want maximum expressiveness and the syntactic freedom of RDF with no computational guarantees.

12 Representation and Schemas RDF Schema RDFS 1 1 classes: rdfs:class 2 properties: rdf:property, rdfs:subclassof 3 domains: rdfs:domain Web Ontology Language OWL 2 (OWL-Lite, OWL-DL) 1 classes: owl:class 2 properties: owl:equivalentclass, owl:sameas Friend of a Friend FOAF ontology 3 1 classes: foaf:agent, foaf:document, foaf:organisation, foaf:person Simple Organization System SKOS ontology 4 1 classes: skos:concept, skos:collection 2 properties: skos:related, skos:broader, skos:narrower

13 Representation RDFS example

14 Representation RDFS example Hierarchical class modelling

15 Representation RDFS example Hierarchical class modelling OWL ontology example

16 Representation vs. Taxonomies

17 Representation Abox vs. Tbox

RDF data published as triples subject, predicate, object SPARQL standard querying language over RDF data principles: 1 URIs as names for things 2 De-referencable URIs 3 Provide information about

18 RDF data published as triples subject, predicate, object SPARQL standard querying language over RDF data principles: 1 URIs as names for things 2 De-referencable URIs 3 Provide information about things using standards: RDF, SPARQL 4 Interlink with other things Billions of triples Interlink all data into one gigantic graph: lod-cloud,schema.org... Microformats: RDFa for annotating web pages

19 Everything done? Only a small fraction of data is actually structured Cumbersome to define manually and explicitly schemas, taxonomies, ontologies Large proportion of data is unstructured or semi-structured Can we automatically extract and model such content?

20 1 Introduction

21 1 Semi-structured data: Wikipedia, WordNet

22 1 Semi-structured data: Wikipedia, WordNet 2 Social Streams: twitter

23 1 Semi-structured data: Wikipedia, WordNet 2 Social Streams: twitter 3 News corpora: NYT Collection, Reuters, Wall Street Journal (WSJ)

24 1 Semi-structured data: Wikipedia, WordNet 2 Social Streams: twitter 3 News corpora: NYT Collection, Reuters, Wall Street Journal (WSJ) 4 Web pages: common-crawl, ClueWeb

25 1 Semi-structured data: Wikipedia, WordNet 2 Social Streams: twitter 3 News corpora: NYT Collection, Reuters, Wall Street Journal (WSJ) 4 Web pages: common-crawl, ClueWeb 5 : lod-cloud

26 1 Introduction

27 Very large corpora of unstructured text. Heterogeneity: languages, quality, domains. Rich underlying structure of unstructured text. Natural Language Processing (NLP): POS, NER, Co-Ref, Dependency Parsing (DP) etc. Utilise NLP output for IE based on syntactic, semantic and lexical patterns. Query and Entity based summarisation.

2008) self-supervised approach for OpenIE.

28 Autonomous understanding of text by machines Construct a belief based on the underlying corpus OpenIE: an IE domain-independent paradigm for relation, classes, and entities extraction. TextRunner (Etzioni et al. 2008) self-supervised approach for OpenIE. Represent each relation as a triple subject predicate object Understanding and semantics of extracted triples is primitive still. Etzioni O., Banko M., J. Cafarella M. AAAI Etzioni O., Banko M., J. Cafarella M. AAAI 2007

29 : TextRunner 1 Self-Supervised Learner

30 : TextRunner 1 Self-Supervised Learner 2 Single-pass extractor

31 : TextRunner 1 Self-Supervised Learner 2 Single-pass extractor 3 Redundancy-Based Assessor

32 DP of chunks of texts for relation extraction

33 DP of chunks of texts for relation extraction Syntactic patterns for relation extraction Michael Webb appeared on Oprah... Michael Webb; appear on; Oprah Schmitz et al. 2007

34 DP of chunks of texts for relation extraction Syntactic patterns for relation extraction Semantic and Lexical patterns for relation extraction Schmitz et al. 2007

35 DP of chunks of texts for relation extraction Syntactic patterns for relation extraction Semantic and Lexical patterns for relation extraction ReVerb: two step approach relation first rather than arguments first 1 identify relations 2 identify arguments Fader et al. 2011

36 ClausIE (del Corro et al., 2013) a clause based approach for relation extraction del Corro et al. 2013

37 ClausIE (del Corro et al., 2013) a clause based approach for relation extraction Automated approach, less restrictive and with improved recall. del Corro et al. 2013

38 Textual content has rich underlying syntactical and semantical structure

39 Textual content has rich underlying syntactical and semantical structure Frequently extracted syntactical and semantical information: POS, Co-Ref and NER.

40 Textual content has rich underlying syntactical and semantical structure Frequently extracted syntactical and semantical information: POS, Co-Ref and NER. Stanford CoreNLP: named entity recognition with specific entity types Person, Organisation, Place, Date.

types Person, Organisation, Place, Date.

41 Textual content has rich underlying syntactical and semantical structure Frequently extracted syntactical and semantical information: POS, Co-Ref and NER. Stanford CoreNLP: named entity recognition with specific entity types Person, Organisation, Place, Date. NED: named entity disambiguation of surface forms with entities from knowledge bases 1 DBpedia Spotlight 2 Wikiminer 3 AIDA...

42 1 Introduction

43 Prominent knowledge base examples: 1 WordNet knowledge base

44 Prominent knowledge base examples: 1 WordNet knowledge base 2 Wikipedia encyclopaedia

45 Prominent knowledge base examples: 1 WordNet knowledge base 2 Wikipedia encyclopaedia 3 DBpedia knowledge base

46 Prominent knowledge base examples: 1 WordNet knowledge base 2 Wikipedia encyclopaedia 3 DBpedia knowledge base 4 YAGO knowledge base

47 and Interlinking Semantic relatedness of entities Exploit existing knowledge base structures Latent relationships via semantic relations

48 Search through structured data in the form of triples Weigh differently different predicates Map user keyword queries to matching entities Blanco et al. 2011

49 Zaveri et al. 2012

50 1 Introduction

51 Large volumes of unstructured and high quality data High applicability of IE techiniques for structuring unstructured data Availability of encyclopaedias in the form of knowledge bases Wide range of applications in Further expansion of knowledge bases with facts about the real world from unstructured text apart from Wikipedia Infoboxes aspects of data

52 1 Introduction

53 1 YAGO: A Core of Semantic Unifying WordNet and Wikipedia. Suchanek F., Kasneci Gj., Weikum G.,. In Proceedings of the 16th WWW, page , Semantic Stability in Social Tagging Streams. Wagner C., Singer P., Strohmaier M., Huberman B.,. CoRR, Test-driven Evaluation of. Kontokostas D., Westphal P., Auer S., Hellmann S., Lehmann J., Cornelissen R., Zaveri A.,. In Proceedings of the 23rd WWW, page , Federated Entity Search Using On-the-Fly Consolidation. Herzig D., Mika P., Blanco R., Tran T.,. In proceedings of the ISWC, page Automatic Expansion of DBpedia Exploiting Wikipedia Cross-Language. Palmero Aprosio A., Giuliano C., Lavelli A.,. In proceedings of the 11th ESWC, page

54 Fabian Suchanek and Gerhard Weikum harvesting in the big-data era. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD 13). Gerhard Weikum and Martin Theobald From information to knowledge: harvesting entities and relationships from web sources. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS 10). Roi Blanco, Peter Mika, and Sebastiano Vigna Effective and efficient entity search in RDF data. In Proceedings of the 10th international conference on The semantic web (ISWC 11). Jeffrey Pound, Peter Mika, and Hugo Zaragoza Ad-hoc object retrieval in the web of data. In Proceedings of the 19th international conference on World wide web (WWW 10). Nunes, B. P., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B. and Nejdl, W.. Combining a co-occurrence-based and a semantic measure for entity linking. In Proceedings of the 10th Extended Conference, 2013 (ESWC 13). Zaveri, Amrapali, Rula, Anisa, Maurino, Andrea, Pietrobon, Ricardo, Lehmann, Jens and Auer, Sören. Assessment Methodologies for Linked Open Data. Journal (2014).

55 Gangemi, Aldo. A Comparison of Tools for the. In Proceedings of the 10th Extended Conference, 2013 (ESWC 13). Mendes, Pablo N., Jakob, Max, Garca-Silva, Andrés and Bizer, Christian. DBpedia spotlight: shedding light on the web of documents. In Proceedings of the 7th International Conference on Semantic Systems, Yosef, Mohamed Amir, Hoffart, Johannes, Bordino, Ilaria, Spaniol, Marc and Weikum, Gerhard. AIDA: An Online Tool for Accurate of Named Entities in Text and Tables. PVLDB 4, no. 12 (2011): Isabelle Augenstein, Sebastian Padó, and Sebastian Rudolph LODifier: generating linked data from unstructured text. In Proceedings of the 9th international conference on The : research and applications (ESWC 12). Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives DBpedia: a nucleus for a web of open data. In Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference (ISWC 07/ASWC 07). Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web (WWW 07). Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S. Weld Open information extraction from the web. Commun. ACM 51, 12 (December 2008),

56 Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni Open language learning for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 12). Raymond J. Mooney and Razvan Bunescu Mining knowledge from text using information extraction. SIGKDD Explor. Newsl. Chang Wang, James Fan, Aditya Kalyanpur, and David Gondek extraction with relation topics. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 11). Robert Isele, Anja Jentzsch, Christian Bizer: Silk Server - Adding missing Links while consuming. COLD Oren Etzioni reading at web scale. In Proceedings of the 2008 International Conference on Web Search and Data Mining. Luciano Del Corro and Rainer Gemulla ClausIE: clause-based open information extraction. In Proceedings of the 22nd international conference on World Wide Web (WWW 13). Rudi Studer, V.Richard Benjamins, Dieter Fensel, engineering: Principles and methods, Data & Engineering, Volume 25, Issues 1 2, 1998, pages Christian Bizer, Tom Heath, and Tim Berners-Lee. International Journal on and Systems 5(3):1 22 (2009)

57 Thank you! Questions?

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web

Creating Large-scale Training and Test Corpora for Extracting Structured Data from the Web Robert Meusel and Heiko Paulheim University of Mannheim, Germany Data and Web Science Group {robert,heiko}@informatik.uni-mannheim.de