Event Registry. Gregor Leban. Jozef Stefan Institute

Similar documents
3XL News: a Cross-lingual News Aggregator and Reader

Linked Data in Archives

Rich Interfaces for Reading News on the Web

NEW DEVELOPMENTS ON THE NEWISTIC PLATFORM. Horatiu Mocian and Ovidiu Dan. Short paper

Entity Linking at TAC Task Description

Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population

DBpedia-An Advancement Towards Content Extraction From Wikipedia

Visualization and text mining of patent and non-patent data

Annotating Spatio-Temporal Information in Documents

Automatically Annotating Text with Linked Open Data

Lecture 24: NER & Entity Linking

August 2012 Daejeon, South Korea

Enabling Semantic Search in Large Open Source Communities

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

Constructing a Focused Taxonomy from a Document Collection

FINDING QUALITY IN QUANTITY: THE CHALLENGE OF DISCOVERING VALUABLE SOURCES FOR INTEGRATION

Question Answering Systems

Chapter 2. Architecture of a Search Engine

RDA: a new cataloging standard for a digital future

A cocktail approach to the VideoCLEF 09 linking task

TIB AV-Portal. Margret Plank 19th of January 2015 TACC Meeting

Part I: Data Mining Foundations

3 Publishing Technique

A Hybrid Neural Model for Type Classification of Entity Mentions

How Co-Occurrence can Complement Semantics?

TEXT MINING: THE NEXT DATA FRONTIER

Semantic Annotation, Search and Analysis

Semantic Web. Tahani Aljehani

Things to consider when using Semantics in your Information Management strategy. Toby Conrad Smartlogic

Natural Language Processing with PoolParty

ACCELERATE YOUR SHAREPOINT ADOPTION AND ROI WITH CONTENT INTELLIGENCE

TRENTINOMEDIA: Exploiting NLP and Background Knowledge to Browse a Large Multimedia News Store

Let me send relevant pictures to my friends while we chat.

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

A simple method to extract abbreviations within a document using regular expressions

A Multilingual Social Media Linguistic Corpus

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

A Framework for BioCuration (part II)

Enhanced retrieval using semantic technologies:

Projects Tools BLAH proposal Conclusion. OntoGene/BioMeXT

Linking Entities in Chinese Queries to Knowledge Graph

Enterprise Knowledge Map: Toward Subject Centric Computing. March 21st, 2007 Dmitry Bogachev

Query Difficulty Prediction for Contextual Image Retrieval

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

DIGITAL GLOBAL AGENCY Pay Per Click (PPC)/Paid Search- A Case Study

TWITTER USE IN ELECTION CAMPAIGNS: TECHNICAL APPENDIX. Jungherr, Andreas. (2016). Twitter Use in Election Campaigns: A Systematic Literature

A Search Engine based on Query Logs, and Search Log Analysis at the University of Sunderland

The CEN Metalex Naming Convention

Final Project Discussion. Adam Meyers Montclair State University

Semantic Annotation of Web Resources Using IdentityRank and Wikipedia

The Wikipedia XML Corpus

CS5670: Computer Vision

Corpus Acquisition from the Interwebs. Christian Buck, University of Edinburgh

Papers for comprehensive viva-voce

For convenience in typing examples, we can shorten the wordnet name to wn.

iknow Use Cases Michael Br ands Senior Product Manager

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

Demo: Linked Open Statistical Data for the Scottish Government

Mapping the library future: Subject navigation for today's and tomorrow's library catalogs

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

A fully-automatic approach to answer geographic queries: GIRSA-WP at GikiP

Historical Text Mining:

Unstructured Data. CS102 Winter 2019

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Template text for search methods: New reviews

3 Background technologies 3.1 OntoGen The two main characteristics of the OntoGen system [1,2,6] are the following.

SVM cont d. Applications face detection [IEEE INTELLIGENT SYSTEMS]

LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.0 July 24, 2015

MASWS Natural Language and the Semantic Web

Data and Information Integration: Information Extraction

It s time for a semantic engine!

Information Retrieval, Information Extraction, and Text Mining Applications for Biology. Slides by Suleyman Cetintas & Luo Si

Enhancing applications with Cognitive APIs IBM Corporation

ELITE TRIATHLON UNIFORMS UPDATED ON

The Emerging Web of Linked Data

Agricultural bibliographic data sharing & interoperability in China

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities

Introduction to Text Mining. Hongning Wang

ELITE TRIATHLON UNIFORMS UPDATED ON

LODtagger. Guide for Users and Developers. Bahar Sateli René Witte. Release 1.1 October 7, 2016

DIGITAL GLOBAL AGENCY Search Engine Optimization- A Case Study. Edited by Digital Global Agency in house team March 2017

A Document Graph Based Query Focused Multi- Document Summarizer

Graph based semantic annotation for enriching documents with linked data

Data Mining Concepts. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech

Supporting Rapid Processing and Interactive Map-Based Exploration of Streaming News

YAM++ : A multi-strategy based approach for Ontology matching task

Visual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017

Query Disambiguation from Web Search Logs

WEB OF SCIENCE. Quick Reference Card SCIENTIFIC ISI WEB OF KNOWLEDGE SM. 1 Search. Cited Reference Search

Named Entity Detection and Entity Linking in the Context of Semantic Web

Semantic Retrieval of the TIB AV-Portal. Dr. Sven Strobel IATUL 2015 July 9, 2015; Hannover

International Journal of Video& Image Processing and Network Security IJVIPNS-IJENS Vol:10 No:02 7

N E T W O R K L I C E N S E

OvidSP Frequently Asked Questions

ELITE MULTISPORT UNIFORMS UPDATED ON

Read Naturally SE Update Windows Network Installation Instructions

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

Other Templates. Overview. URL Shortener & Redirect Page

We offer background check and identity verification services to employers, businesses, and individuals. For example, we provide:

Transcription:

Event Registry Gregor Leban Jozef Stefan Institute

Event registry http://eventregistry.org/ Event registry is a system for collecting news content published globally automatic identification of world events from news articles Extraction of core event information Extensive search options Rich visualization options

What is an event? Event = any significant happening in the world Assumption: If something is significant reported by various news publishers Event is identified through a collection of news articles that report about the same thing

Event registry pipeline Data collectio n Main strea m news Article-level processing Event construction Semantic annotation Article clustering Extraction of date references Crosslingual cluster matching Detection of article location Crosslingual article matching Detection of article duplicates Event storage & maintenanc e Manual event administration Event formation Event info. extraction Filling event template Identifying related events Fronten d interfac e API access

Event registry pipeline Data collectio n Main strea m news Article-level processing Event construction Semantic annotation Article clustering Extraction of date references Crosslingual cluster matching Detection of article location Crosslingual article matching Detection of article duplicates Event storage & maintenanc e Manual event administration Event formation Event info. extraction Filling event template Identifying related events Fronten d interfac e API access

Input data We use a data collection service News Feed http://newsfeed.ijs.si/ Monitoring ~100k mainstream news RSS feeds Data volume: ~300k articles per day 15 languages used in Event Registry: eng, ger, spa, chi, slo, cat, por, ita, fra, rus, ara, tur, cro, ser

Event registry pipeline Data collectio n Main strea m news Article-level processing Event construction Semantic annotation Article clustering Extraction of date references Crosslingual cluster matching Detection of article location Crosslingual article matching Detection of article duplicates Event storage & maintenanc e Manual event administration Event formation Event info. extraction Filling event template Identifying related events Fronten d interfac e API access

Article semantic annotation We want to capture the meaning of the words mentioned in the text ample from: http://nlp.cs.rpi.edu/paper/wikificationtutorial.pdf

Article semantic annotation (2) Ambiguity same word, different meaning (e.g. Chicago) Variability different words, same meaning (Mac, Macintosh) Entity linking = Named entity recognition + disambiguation Knowledge base https://en.wikipedia.org/wiki/chicago_(typeface) https://en.wikipedia.org/wiki/macintosh https://en.wikipedia.org/wiki/chicago_(band)

Wikifier demo (http://wikifier.org/)

Computing similarity between articles in different languages Question: how similar is this Spanish article to this English article? Translation tool slow/expensive Alternative: Canonical correlation analysis Map each article into a common semantic space Use dot product to compute similarity between documents datos, minería, data, mining, Common utilizando, Train artificial, a mapping using an aligned corpus (formodelos, example knowledge, procesamiento semantic machine,discovery, inteligencia,..., Wikipedia) buzzword, term,..., space Documents as vectors 500 dimensions

CCA demo ( http://xling.ijs.si) Rupnik, et.al - http://www.jair.org/papers/paper4780.html

Event registry pipeline Data collectio n Main strea m news Article-level processing Event construction Semantic annotation Article clustering Extraction of date references Crosslingual cluster matching Detection of article location Crosslingual article matching Detection of article duplicates Event storage & maintenanc e Manual event administration Event formation Event info. extraction Filling event template Identifying related events Fronten d interfac e API access

Article clustering For each language we try to identify groups of articles that describe a single event Online clustering algorithm Cluster based on article title article content detected named entities Different articles about same event

Cross-lingual cluster linking Each cluster only contains articles in the same language Clusters in different languages can describe the Spanish English samecluster event how do we find them? cluster? =

Cross-lingual cluster linking (2) Use machine learning Extract a set of features for evaluated pair of clusters Similarity of relevant concepts (semantic annotations) in each cluster Time difference between articles in each cluster Similarity of article categories Agreement in story location CCA similarity between articles in both clusters Classify using SVM model trained on >1.000 examples 90% accuracy

Event formation Using one or more linked clusters we form an event Each event is assigned a unique id ttp://eventregistry.org/event/4284299

Extracting event information Use articles assigned to the event to extract event information Information to extract: Location Event date Main event entities and keywords Event categories Event title and snippet

DEMO ( http://eventregistry.org/)

API access Content in Event Registry can be accessed using the API To simplify access, a Python library is available at: https://github.com/gregorleban/eventregistry

API access example (1)

API access example (2) events : { resultcount : 122, results : [ {'articlecounts': {'eng': 54.0, 'total': 54.0}, 'categories': [{...}], 'concepts': [{...},...], 'eventdate': '2014-08-29', 'eventdateend': '', 'multilinginfo': { 'eng': { 'title':..., 'summary': }}, 'uri': '1211229', 'wgt': 9.0} }, ]}

API access example (3)

API access example (4)

Future / Ongoing work

Detection of news bias Various ways in which news publishers can be biased Examples: Gender bias News source citations Topic bias Geographical bias Vocabulary bias

Information diffusion/propagation Analysis of how information propagates from one publisher to the next

Infobox extraction / slot filling Events are of different types (e.g. earthquakes, mergers and acquisitions, product releases, ) For each event type we can define a specific set of properties (slots) that should be extracted

Causality Goal: Finding causal templates in event sequences Example 1: campaign events election day inauguration Example 2: flood outbreak of a disease How to do it: For each event we can identify related events Generalize events to find common patterns

Event registry Event registry provides extensive search options...

Event registry (2)... and visualizations of search results

Conference Location 201#-##-##

Conference Location 201#-##-##

Conference Location 201#-##-##

Conference Location 201#-##-##

Conference Location 201#-##-##