Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules

Similar documents
SUTIME: Evaluation in TempEval-3

Annotating Spatio-Temporal Information in Documents

Example. Section: PS 709 Examples of Calculations of Reduced Hours of Work Last Revised: February 2017 Last Reviewed: February 2017 Next Review:

TempWeb rd Temporal Web Analytics Workshop

Temporal Information Extraction using Regular Expressions

Final Project Discussion. Adam Meyers Montclair State University

Automatic Metadata Extraction for Archival Description and Access

Fast and Effective System for Name Entity Recognition on Big Data

User Guide on Online Resource Booking System (ORBS) Centre for Genomic Sciences HKU

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

KDD- Service based Numerical Entity Searcher (KSNES) Presentation 2 on March 31 st, Naga Sowjanya Karumuri. CIS 895 MSE PROJECT

WebAnno: a flexible, web-based annotation tool for CLARIN

Yahoo! Webscope Datasets Catalog January 2009, 19 Datasets Available

An UIMA based Tool Suite for Semantic Text Processing

HP Project and Portfolio Management Center

Assignment #1: Named Entity Recognition

PRIS at TAC2012 KBP Track

NLP Final Project Fall 2015, Due Friday, December 18

School of Computing and Information Systems The University of Melbourne COMP90042 WEB SEARCH AND TEXT ANALYSIS (Semester 1, 2017)

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Nortel Enterprise Reporting Quality Monitoring Meta-Model Guide

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part IV Dependency Parsing 2 Winter 2019

RFC Editor Reporting September 2010

Universitat Politècnica de Catalunya Facultat d'informàtica de Barcelona Master of Innovation and Research in Informatics(MIRI)

Temponym Tagging: Temporal Scopes for Textual Phrases

Toward Part-based Document Image Decoding

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

The CKY Parsing Algorithm and PCFGs. COMP-550 Oct 12, 2017

Associating video frames with text

LEED Interpretations and Addenda Database Help Document

Lecture 11: Regular Expressions. LING 1330/2330: Introduction to Computational Linguistics Na-Rae Han

Identifying and Understanding Dates and Times in

Networked Access to Library Resources

JSON-LD 1.0 Processing Algorithms and API

Tokenization - Definition

A General-Purpose Rule Extractor for SCFG-Based Machine Translation

44 Tricks with the 4mat Procedure

Interactions 2 Listening and Speaking Guide to CD Tracks

Genesys Info Mart. date-time Section

Excel: Tables, Pivot Tables & More

Online Algorithms. - Lecture 4 -

From Ontologies to Information Extraction and Back

Making Sense Out of the Web

IT Services Performance Report

Lecture11b: NLP (Introduction)

A cocktail approach to the VideoCLEF 09 linking task

Content Based Key-Word Recommender

This book is licensed under a Creative Commons Attribution 3.0 License

Module 3: GATE and Social Media. Part 4. Named entities

How To: Creating Relativity Batch Sets & Batches

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

DDOS-GUARD Q DDoS Attack Report

A tool for Cross-Language Pair Annotations: CLPA

afewadminnotes CSC324 Formal Language Theory Dealing with Ambiguity: Precedence Example Office Hours: (in BA 4237) Monday 3 4pm Wednesdays 1 2pm

Developing Focused Crawlers for Genre Specific Search Engines

Information Retrieval CSCI

Sitecore StatCenter Module. User Guide. A guide for end users and administrators.

ConceptDraw PROJECT Server

Needs Driven Workflow Design

Grammar. Dates and Times. Components

Natural Language Processing

Employment Ontario Information System (EOIS) Service Provider (SP) Connect

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Privacy and Security in Online Social Networks Department of Computer Science and Engineering Indian Institute of Technology, Madras

CS 124/LINGUIST 180 From Languages to Information

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

SEO Enhanced Packages Powered by Eworks WSI & WSI SociallMindz - Call:

Lecture 4: Syntax Specification

DOWNLOAD OR READ : WEEKLY PLANNER 2018 TWO PAGE WEEK AT A GLANCE PDF EBOOK EPUB MOBI

Question Answering Using XML-Tagged Documents

Process Modelling using Petri Nets

A Practical Approach to Programming With Assertions

The Year argument can be one to four digits between 1 and Month is a number representing the month of the year between 1 and 12.

Domain-specific Concept-based Information Retrieval System

Massively Increasing TIMEX3 Resources: A Transduction Approach

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

Vision Plan. For KDD- Service based Numerical Entity Searcher (KSNES) Version 2.0

Introduction CJ s Cross-Device Tracking Solution Key Strategies Shopping & Device Paths Average Order Value...

How to.. What is the point of it?

The New C Standard (Excerpted material)

The Data Divide. Luke Segars - Google CS10

New Features in MONITOR version 8.1

Location Tagging in Text

Regular Expressions. Agenda for Today. Grammar for a Tiny Language. Programming Language Specifications

Enterprise Architect. User Guide Series. Roadmap Diagrams. Author: Sparx Systems. Date: 30/06/2017. Version: 1.0 CREATED WITH

Reports & Updates. Chapter 10

Quick Start Guide. Version R94. English

Stack- propaga+on: Improved Representa+on Learning for Syntax

PROJECT Elmarknadshubb DATE DRAFT VERSION /1263. Elmarknadshubb. BRS-SE-611 Register meter value 1/12

Microsoft Project. EPIC 76/106 Manchester St. PO Box 362 Christchurch 8140

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 4, 10.9

The Goal of this Document. Where to Start?

Annotation Science From Theory to Practice and Use Introduction A bit of history

Minnesota No-Fault Arbitration A Guide to the AAA s New Online Scheduling System Effective June 1, 2018

Question Answering Systems

BASE STANDARD PREMIER. PRICE 290 p/m 723 p/m 1809 p/m SEO (SEARCH ENGINE OPTIMIZATION)

CSC401 Natural Language Computing

Form Identifying. Figure 1 A typical HTML form

Syntax Analysis. Chapter 4

Transcription:

Time Expression Analysis and Recognition Using Syntactic Token Types and General Heuristic Rules Xiaoshi Zhong, Aixin Sun, and Erik Cambria Computer Science and Engineering Nanyang Technological University {xszhong, axsun, cambria}@ntu.edu.sg

Outline Time expression analysis Datasets: TimeBank, Gigaword, WikiWars, Tweets Findings: short expressions, occurrence, small vocabulary, similar syntactic behavior Time expression recognition SynTime: syntactic token types and general heuristic rules Baselines: HeidelTime, SUTime, UWTime

Time Expression Analysis Datasets TimeBank Gigaword WikiWars Tweets Findings Short time expressions Occurrence Small vocabulary Similar syntactic behaviour Example time expressions: now today Friday February the last week 13 January 1951 June 30, 1990 8 to 20 days the third quarter of 1984

Time Expression Analysis - Datasets Datasets TimeBank: a benchmark dataset used in TempEval series Gigaword: a large dataset with generated labels and used in TempEval-3 WikiWars: a specific domain dataset collected from Wikipedia about war Tweets: a manually labeled dataset with informal text collected from Twitter Statistics of the datasets Dataset #Docs #Words #TIMEX TimeBank 183 61,418 1,243 Gigaword 2,452 666,309 12,739 WikiWars 22 119,468 2,671 Tweets 942 18,199 1,127 The four datasets vary in source, size, domain, and text type, but we will see that their time expressions demonstrate similar characteristics.

Time Expression Analysis Finding 1 Short time expressions: time expressions are very short. 80% of time expressions contain 3 words 90% of time expressions contain 4 words Time expressions follow a similar length distribution Average length of time expressions Dataset Average length TimeBank 2.00 Gigaword 1.70 WikiWars 2.38 Tweets 1.51 Average length: about 2 words

Time Expression Analysis Finding 2 Occurrence: most of time expressions contain time token(s). Percentage of time expressions that contain time token(s) Dataset Percentage TimeBank 94.61 Gigaword 96.44 WikiWars 91.81 Tweets 96.01 Example time tokens (red): now today Friday February the last week 13 January 1951 June 30, 1990 8 to 20 days the third quarter of 1984

Time Expression Analysis Finding 3 Small vocabulary: only a small group of time words are used to express time information. Number of distinct words and time tokens in time expressions Dataset #Words #Time tokens TimeBank 130 64 Gigaword 214 80 WikiWars 224 74 Tweets 107 64 Number of distinct words and time tokens across four datasets #Words #Time tokens 350 123 45 distinct time tokens appear in all the four datasets. That means, time expressions highly overlap at their time tokens. next year 2 years year 1 10 yrs ago Overlap at year

Time Expression Analysis Finding 4 Similar syntactic behaviour: (1) POS information cannot distinguish time expressions from common text, but (2) within time expressions, POS tags can help distinguish their constituents. (1) For the top 40 POS tags (10 4 datasets), 37 have percentage lower than 20%, other 3 are CD. (2) Time tokens mainly have NN* and RB, modifiers have JJ and RB, and numerals have CD.

Time Expression Analysis Eureka! Similar syntactic behaviour: (1) POS information cannot distinguish time expressions from common text, but (2) within time expressions, POS tags can help distinguish their constituents. (1) For the top 40 POS tags (10 4 datasets), 37 have percentage lower than 20%, other 3 are CD. (2) Time tokens mainly have NN* and RB, modifiers have JJ and RB, and numerals have CD. When seeing (2), we realize that this is exactly how linguists define part-of-speech for language; similar words have similar syntactic behaviour. The definition of part-of-speech for language inspires us to define a type system for the time expression, part of language. Our Eureka! moment

Time Expression Analysis - Summary Summary On average, a time expression contains two tokens; one is time token and the other is modifier/numeral. And the time tokens are in small size. Idea for recognition To recognize a time expression, we first recognize the time token, then recognize the modifier/numeral.

Time Expression Analysis - Idea Summary On average, a time expression contains two tokens; one is time token and the other is modifier/numeral. And the time tokens are in small size. Idea for recognition To recognize a time expression, we first recognize the time token, then recognize the modifier/numeral. 20 days; this week; next year; July 29;

Time Expression Analysis - Idea Summary On average, a time expression contains two tokens; one is time token and the other is modifier/numeral. And the time tokens are in small size. Idea for recognition To recognize a time expression, we first recognize the time token, then recognize the modifier/numeral. Time token 20 days; this week; next year; July 29;

Time Expression Analysis - Idea Summary On average, a time expression contains two tokens; one is time token and the other is modifier/numeral. And the time tokens are in small size. Idea for recognition To recognize a time expression, we first recognize the time token, then recognize the modifier/numeral. Time token Modifier/Numeral 20 days; this week; next year; July 29;

Time Expression Recognition SynTime Syntactic token types General heuristic rules Baseline methods HeidelTime SUTime UWTime Experiment datasets TimeBank WikiWars Tweets

Time Expression Recognition - SynTime Syntactic token types General heuristic rules

Time Expression Recognition - SynTime Syntactic token types A type system Time token: explicitly express time information, e.g., year 15 token types: DECADE, YEAR, SEASON, MONTH, WEEK, DATE, TIME, DAY_TIME, TIMELINE, HOLIDAY, PERIOD, DURATION, TIME_UNIT, TIME_ZONE, ERA Modifier: modify time tokens, e.g., next modifies year in next year 5 token types: PREFIX, SUFFIX, LINKAGE, COMMA, IN_ARTICLE Numeral: ordinals and numbers, e.g., 10 in next 10 years 1 token type: NUMERAL Token types to tokens is like POS tags to words POS tags: next/jj 10/CD years/nns Token types: next/prefix 10/NUMERAL years/time_unit

Time Expression Recognition - SynTime General heuristic rules Only relevant to token types Independent of specific tokens

SynTime Layout Rule level General Heuristic Rules Type level Time Token, Modifier, Numeral Token level 1989, February, 12:55, this year, 3 months ago,... Token level: time-related tokens and token regular expressions Type level: token types group the tokens and token regular expressions Rule level: heuristic rules work on token types and are independent of specific tokens

SynTime Overview in practice Add keywords under defined token types and do not change any rules Identify time tokens Import token regex to time token, modifier, numeral Identify modifiers and numerals by expanding the time tokens boundaries Extract time expressions

An example: the third quarter of 1984 A sequence of tokens: the third quarter of 1984

An example: the third quarter of 1984 Assign tokens with token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR A sequence of tokens: the third quarter of 1984

An example: the third quarter of 1984 Identify time tokens Heuristic Rules Assign tokens with token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR A sequence of tokens: the third quarter of 1984

An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens surroundings Identify time tokens Heuristic Rules Assign tokens with token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR A sequence of tokens: the third quarter of 1984

An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens surroundings Identify time tokens Heuristic Rules Assign tokens with token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR A sequence of tokens: the third quarter of 1984

An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens surroundings Identify time tokens Heuristic Rules Assign tokens with token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR A sequence of tokens: the third quarter of 1984

An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens surroundings Identify time tokens Heuristic Rules Assign tokens with token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR A sequence of tokens: the third quarter of 1984

An example: the third quarter of 1984 Identify modifiers and numerals by searching time tokens surroundings Identify time tokens Heuristic Rules Assign tokens with token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR A sequence of tokens: the third quarter of 1984

An example: the third quarter of 1984 A sequence of token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR

An example: the third quarter of 1984 A sequence of token types PREFIX NUMERAL TIME_UNIT PREFIX YEAR Export a sequence of tokens as time expression the third quarter of 1984

An example: the third quarter of 1984 Time expression: the third quarter of 1984

Time Expression Recognition - Experiments SynTime SynTime-I: Initial version SynTime-E: Expanded version, adding keywords to SynTime-I (Add keywords under the defined token types and do not change any rules.) Baseline methods HeidelTime: rule-based method SUTime: rule-based method UWTime: learning-based method Experiment datasets TimeBank: comprehensive data in formal text WikiWars: specific domain data in formal text Tweets: comprehensive data in informal text

Overall performance. The best results are in boldface and the second best are underlined. Some results are borrowed from their original papers and the papers indicated by the references. Dataset TimeBank WikiWars Tweets Methods Strict Match Relexed Match Pr. Re. F1 Pr. Re. F1 HeidelTime(Strotgen et al., 2013) 83.85 78.99 81.34 93.08 87.68 90.30 SUTime(Chang and Manning, 2013) 78.72 80.43 79.57 89.36 91.30 90.32 UWTime(Lee et al., 2014) 86.10 80.40 83.10 94.60 88.40 91.40 SynTime-I 91.43 92.75 92.09 94.29 95.65 94.96 SynTime-E 91.49 93.48 92.47 93.62 95.65 94.62 HeidelTime(Lee et al., 2014) 85.20 79.30 82.10 92.60 86.20 89.30 SUTime 78.61 76.69 76.64 95.74 89.57 92.55 UWTime(Lee et al., 2014) 87.70 78.80 83.00 97.60 87.60 92.30 SynTime-I 80.00 80.22 80.11 92.16 92.41 92.29 SynTime-E 79.18 83.47 81.27 90.49 95.39 92.88 HeidelTime 89.58 72.88 80.37 95.83 77.97 85.98 SUTime 76.03 77.97 76.99 88.43 90.68 89.54 UWTime 88.54 72.03 79.44 96.88 78.81 86.92 SynTime-I 89.52 94.07 91.74 93.55 98.31 95.87 SynTime-E 89.20 94.49 91.77 93.20 98.78 95.88

Difference from other Rule-based Methods Method SynTime Other rule-based methods Layout Rule level Type level Token level General Heuristic Rules Time Token, Modifier, Numeral 1989, February, 12:55, this year, 3 months ago,... Rule level Token level Deterministic Rules 1989, February, 12:55, this year, 3 months ago,... Property Heuristic rules work on token types and are independent of specific tokens, thus they are independent of specific domains and specific text types and specific languages. Deterministic rules directly work on tokens and phrases in a fixed manner, thus the taggers lack flexibility Example Heuristic Rules PREFIX NUMERAL TIME_UNIT PREFIX YEAR /the/? [{tag:jj}]? ($NUM_ORD) /-/? [{tag:jj}]? /quarter/ the third quarter of 1984

A simple idea Rules can be designed with generality and heuristics Heuristic Rules PREFIX NUMERAL TIME_UNIT PREFIX YEAR the third quarter of 1984