MPI-INF AT THE NTCIR-11 TEMPORAL QUERY CLASSIFICATION TASK

Similar documents
MPI-INF at the NTCIR-11 Temporal Query Classification Task

NTCIR-11 Temporal Information Access Task

NLP Final Project Fall 2015, Due Friday, December 18

Exam Marco Kuhlmann. This exam consists of three parts:

Time-aware Approaches to Information Retrieval

Temporal Information Access (Temporalia-2)

A Hybrid Neural Model for Type Classification of Entity Mentions

NTCIR-11 Temporalia. Temporal Information Access

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

Annotating Spatio-Temporal Information in Documents

Using machine learning to predict temporal orientation of search engines queries in the Temporalia challenge

Temporal Query Classification at Different Granularities

Question Answering Systems

On the automatic classification of app reviews

The Perceptron. Simon Šuster, University of Groningen. Course Learning from data November 18, 2013

Understanding User s Search Behavior towards Spiky Events

Summer 2010 Research Project. Spam Filtering by Text Classification. Manoj Reddy Advisor: Dr. Behrang Mohit

Hands on Datamining & Machine Learning with Weka

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

CE807 Lab 3 Text classification with Python

On the Automatic Classification of App Reviews

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

ResPubliQA 2010

for Searching Social Media Posts

Domain Based Named Entity Recognition using Naive Bayes

FLL: Answering World History Exams by Utilizing Search Results and Virtual Examples

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

KGO at the NTCIR-12 Temporalia Task

Link Prediction for Social Network

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Classification. I don t like spam. Spam, Spam, Spam. Information Retrieval

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

Document Structure Analysis in Associative Patent Retrieval

Automatic Labeling of Issues on Github A Machine learning Approach

Using machine learning to predict temporal orientation of search engines queries in the Temporalia challenge

NLP Lab Session Week 9, October 28, 2015 Classification and Feature Sets in the NLTK, Part 1. Getting Started

6.034 Design Assignment 2

Entity-centric Topic Extraction and Exploration: A Network-based Approach

Final Project Discussion. Adam Meyers Montclair State University

A Textual Entailment System using Web based Machine Translation System

Customer Clustering using RFM analysis

Vision Plan. For KDD- Service based Numerical Entity Searcher (KSNES) Version 2.0

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

##SQLSatMadrid. Project [Vélib by Cortana]

Dynamic Feature Selection for Dependency Parsing

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

NTCIR Temporalia: A Test Collection for Temporal Information Access Research

Query Intent Detection using Convolutional Neural Networks

Automatic Summarization

Query Difficulty Prediction for Contextual Image Retrieval

Overview of NTCIR-11 Temporal Information Access (Temporalia) Task

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Data Mining Algorithms: Basic Methods

Domain Adaptation Using Domain Similarity- and Domain Complexity-based Instance Selection for Cross-domain Sentiment Analysis

In this project, I examined methods to classify a corpus of s by their content in order to suggest text blocks for semi-automatic replies.

Estimating Embedding Vectors for Queries

Detection and Extraction of Events from s

slide courtesy of D. Yarowsky Splitting Words a.k.a. Word Sense Disambiguation Intro to NLP - J. Eisner 1

Towards Summarizing the Web of Entities

IN4325 Query refinement. Claudia Hauff (WIS, TU Delft)

A Practical Passage-based Approach for Chinese Document Retrieval

7 Techniques for Data Dimensionality Reduction

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Lecture 10 May 14, Prabhakar Raghavan

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Fast and Effective System for Name Entity Recognition on Big Data

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

CSEP 517 Natural Language Processing Autumn 2013

ICRCS at Intent2: Applying Rough Set and Semantic Relevance for Subtopic Mining

Influence of Word Normalization on Text Classification

Apache UIMA and Mayo ctakes

Personalized Web Search

Graph-Based Relation Validation Method

Precise Medication Extraction using Agile Text Mining

Query Intent Detection using Convolutional Neural Networks

Sentiment analysis under temporal shift

AI32 Guide to Weka. Andrew Roberts 1st March 2005

BaggTaming Learning from Wild and Tame Data

Watson & WMR2017. (slides mostly derived from Jim Hendler and Simon Ellis, Rensselaer Polytechnic Institute, or from IBM itself)

INF5820/INF9820 LANGUAGE TECHNOLOGICAL APPLICATIONS. Jan Tore Lønning, Lecture 8, 12 Oct

Ghent University-IBCN Participation in TAC-KBP 2015 Cold Start Slot Filling task

Topics du jour CS347. Centroid/NN. Example

Gaussian Processes for Robotics. McGill COMP 765 Oct 24 th, 2017

Temporal Modeling and Missing Data Estimation for MODIS Vegetation data

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Domain-specific Concept-based Information Retrieval System

Wikulu: Information Management in Wikis Enhanced by Language Technologies

Refresher on Dependency Syntax and the Nivre Algorithm

Stanbol Enhancer. Use Custom Vocabularies with the. Rupert Westenthaler, Salzburg Research, Austria. 07.

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Hidden Markov Model for Sequential Data

Multi-Document Summarizer for Earthquake News Written in Myanmar Language

CS 188: Artificial Intelligence Fall Machine Learning

Transcription:

MPI-INF AT THE NTCIR-11 TEMPORAL QUERY CLASSIFICATION TASK Robin Burghartz Klaus Berberich Max Planck Institute for Informatics, Saarbrücken, Germany

General Approach Overall strategy for TQIC subtask: 1. Focus on deriving features for classification 2. Rely on established off-the-shelf components to test a wide spectrum of different features

General Approach Components: WEKA : for classification StanfordCoreNLP: for the various NLP processing needs of the task Hadoop/MapReduce: for the construction of a temporal dictionary

Overview 1. Feature Design 2. Results 3. Conclusion

Overview 1. Feature Design 2. Results 3. Conclusion

Feature Design - Overview Different classes of features: Collection features analyze temporal expressions and timestamps from the collection s documents Linguistic features consider linguistic properties of the query string Trigger word features allow the classifier to learn very clear time determiners (e.g. forecast, which hints to a future intent) Semantic feature(s) capture(s) the topic of a query

Collection Features

Collection Features (1) Distribution of temporal expressions What time do temporal expressions from pseudo-relevant documents refer to? => Content time

Collection Features (1) Distribution of temporal expressions Content time Example: Two principles: If temporal expression is imprecise, distribute weight Temporal expressions from more relevant documents get bigger weight in the distribution d1: 2014 d2: January 2014 Jan/01 Dec/31

Collection Features (1) Distribution of temporal expressions Content time distribution, i.e. distribution of temporal expressions from relevant documents Example Features: How much weight is before/on/after the query issue time? Where are the quantiles relative to the query issue time?

Collection Features (1) Distribution of temporal expressions Content time R: pseudo-relevant documents TE d : set of temporal expressions from document d

Collection Features (2) Distribution of timestamps When were relevant documents published? => Publication time

Collection Features (2) Distribution of timestamps Publication time: This approach is close to the approach used in Temporal profiles of queries [Jones et Diaz] (similar classification task, but focus on the distinction between atemporal/temporal).

Collection Features (2) Distribution of timestamps Publication time distribution, i.e. distribution of the documents timestamps

Collection Features (2) Distribution of timestamps Publication time distribution, i.e. distribution of the documents timestamps Example Features: How peaky is the curve? (kurtosis) How much does the distribution deviate from the background distribution? (temporal KL divergence)

Collection Features (3) Temporal Dictionary Iterate over the complete document collection and for each unigram/bigram x create a dictionary entry containing the following scores: dictpastscore dictrecencyscore dictfuturescore avgtimex fraction of sentences with x that contain a temporal expression which refers to the past/present/future (relative to document timestamp) average number of temporal expressions that appear together with x

Collection Features (3) Temporal Dictionary Example: live stream moon landing dictpastscore 0.08 0.28 dictrecencyscore 0.20 0.08 dictfuturescore 0.07 0.09 avgtimex 0.49 0.75 Given a query, we can now determine average values of the query terms as features! (Simple dictionary lookup)

Linguistic Features (any features obtained from NLP taggers)

Linguistic Features (1) POS/Named Entity Features (Examples) startswithnoun: time in london startswithunpersonalverb: lose weight quickly startswithinflectedverb: did the Pirates win today containsentity: nba playoffs 2013 standings No obvious relation to temporal class!

Linguistic Features (2) Tense Features containspasttense: what was the cold war containspresenttense: time is of the essence containsfuturetense: when will the sun rise tomorrow

Linguistic Features (3) Temporal expressions in the query string containspastdate: fifa world cup 2006 containsfuturedate: fifa world cup 2018 Different features for temporal expressions which refer to a recent date

Overview 1. Feature Design 2. Results 3. Conclusion

Setup We train two classifiers on our training set (using the previously designed features): Naive Bayes classifier (WEKA implementation NaiveBayes) Decision tree (WEKA implementation J48) We use these classifiers to assign classes to unseen instances.

Setup We use two baselines: Random classifier : 25 % Unigram/bigram feature classifier : 56.30 % We can correctly predict the class of more than half of our instances only based on unigrams and bigrams!

Results Accuracy values from the formal runs (in %): total P R F A Run 1: 62.33 53 57 65 73 Run 2: 64.00 60 49 71 76 Run 3: 61.67 60 44 63 80 Our approach yields a relative improvement of ~10% over the baseline.

Results Insights: Temporal dictionary: we can correctly classify over 50% of our queries with only three features! Linguistic features: exploiting linguistic query properties could be useful (on queries with certain properties) Maybe not so effective in the real world

Overview 1. Feature Design 2. Results 3. Conclusion

Conclusion This presentation gave a concise overview over some features that we designed Collection features: time distributions, temporal dictionary Linguistic features: POS, NER, tense, temporal expressions summarized the experimental results Accuracy estimates from formal runs Observations

Thank you for your attention!

References Temporal Query Classification, Bachelor s Thesis, Robin Burghartz, 2014.