for Searching Social Media Posts

Similar documents
The University of Illinois Graduate School of Library and Information Science at TREC 2011

Robust Relevance-Based Language Models

Jan Pedersen 22 July 2010

Sentiment analysis under temporal shift

Compressing and Decoding Term Statistics Time Series

Improving Difficult Queries by Leveraging Clusters in Term Graph

Incorporating Temporal Information in Microblog Retrieval

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Hyperlink-Extended Pseudo Relevance Feedback for Improved. Microblog Retrieval

From Neural Re-Ranking to Neural Ranking:

UMass at TREC 2017 Common Core Track

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

An Attempt to Identify Weakest and Strongest Queries

Entity and Knowledge Base-oriented Information Retrieval

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

Estimating Embedding Vectors for Queries

Navigating the User Query Space

A Formal Approach to Score Normalization for Meta-search

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Query Likelihood with Negative Query Generation

TriRank: Review-aware Explainable Recommendation by Modeling Aspects

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

Ranking with Query-Dependent Loss for Web Search

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Context-Based Topic Models for Query Modification

S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking

Open Research Online The Open University s repository of research publications and other research outputs

Informativeness for Adhoc IR Evaluation:

Investigate the use of Anchor-Text and of Query- Document Similarity Scores to Predict the Performance of Search Engine

High Accuracy Retrieval with Multiple Nested Ranker

Automatic Boolean Query Suggestion for Professional Search

Information Retrieval

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

WebSci and Learning to Rank for IR

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

QU at TREC-2014: Online Clustering with Temporal and Topical Expansion for Tweet Timeline Generation

The Impact of Future Term Statistics in Real-Time Tweet Search

A Deep Relevance Matching Model for Ad-hoc Retrieval

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

Comparison of Optimization Methods for L1-regularized Logistic Regression

Separating Objects and Clutter in Indoor Scenes

Text Categorization (I)

Modern Retrieval Evaluations. Hongning Wang

Federated Search. Jaime Arguello INLS 509: Information Retrieval November 21, Thursday, November 17, 16

Columbia University (office) Computer Science Department (mobile) Amsterdam Avenue

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Automatic Domain Partitioning for Multi-Domain Learning

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Predicting Query Performance on the Web

On Duplicate Results in a Search Session

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Dynamic Embeddings for User Profiling in Twitter

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

MPI-INF AT THE NTCIR-11 TEMPORAL QUERY CLASSIFICATION TASK

Promoting Ranking Diversity for Biomedical Information Retrieval based on LDA

Learning to Rank for Faceted Search Bridging the gap between theory and practice

The University of Amsterdam at the CLEF 2008 Domain Specific Track

PUTTING CONTEXT INTO SEARCH AND SEARCH INTO CONTEXT. Susan Dumais, Microsoft Research

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Exploiting Index Pruning Methods for Clustering XML Collections

Reducing Click and Skip Errors in Search Result Ranking

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Semi supervised clustering for Text Clustering

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

AComparisonofRetrievalModelsusingTerm Dependencies

Minghai Liu, Rui Cai, Ming Zhang, and Lei Zhang. Microsoft Research, Asia School of EECS, Peking University

Learning Temporal-Dependent Ranking Models

NTT SMT System for IWSLT Katsuhito Sudoh, Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki NTT Communication Science Labs.

Information Retrieval Using Context Based Document Indexing and Term Graph

Latent Space Model for Road Networks to Predict Time-Varying Traffic. Presented by: Rob Fitzgerald Spring 2017

Automatic Query Type Identification Based on Click Through Information

TREC OpenSearch Planning Session

SQUINT SVM for Identification of Relevant Sections in Web Pages for Web Search

Inferring User Search for Feedback Sessions

Retrieval and Feedback Models for Blog Distillation

An Empirical Study of Lazy Multilabel Classification Algorithms

Active Evaluation of Ranking Functions based on Graded Relevance (Extended Abstract)

ABSTRACT. Categories & Subject Descriptors: H.3.3 [Information Search and Retrieval]: General Terms: Algorithms Keywords: Resource Selection

Individualized Error Estimation for Classification and Regression Models

This is an author-deposited version published in : Eprints ID : 12965

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

A Cluster-Based Resampling Method for Pseudo- Relevance Feedback

QCRI at TREC 2014: Applying the KISS principle for the TTG task in the Microblog Track

Leveraging Temporal Query-Term Dependency for Time-Aware Information Access

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Using Coherence-based Measures to Predict Query Difficulty

Seq2SQL: Generating Structured Queries from Natural Language Using Reinforcement Learning

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

A study of classification algorithms using Rapidminer

Automatic people tagging for expertise profiling in the enterprise

Supervised Reranking for Web Image Search

Improving Patent Search by Search Result Diversification

Content-based Dimensionality Reduction for Recommender Systems

University of Delaware at Diversity Task of Web Track 2010

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

A Practical Passage-based Approach for Chinese Document Retrieval

Transcription:

Mining the Temporal Statistics of Query Terms for Searching Social Media Posts ICTIR 17 Amsterdam Oct. 1 st 2017 Jinfeng Rao Ferhan Ture Xing Niu Jimmy Lin

Task: Ad-hoc Search on Social Media domain Stream of Tweets A ranked list of tweets Interest Profiles Interest (User s Profile queries) (~topic) Example query MB001: BBC world servicestuff cut....

Background Challenges for Social Media Search Usually very short, 140 characters for tweets. Posts are written in a highly concise way, sometimes canbe quite noisy. Many abbreviations,misspellings, typos, emojis, hashtags,etc. Time is an important relevance signal Relevant posts are more likely to group together at the time shaking news happened. Example query MB001 from TREC 2011: BBC world service stuff cut distribution of relevant docs (ground truth) in below. x axis denotes the number of days prior to query time. the height of a bar denotes the number of relevant docs during that time interval.

Combine Lexical and Temporal Evidence Moving window, Dakka et al.tkde 12 [2] (pseudo trend) Kernel density estimation, Efron et al. SIGIR 14 [3] ˆf! (x) = 1 nx x xi! i K nh h i=0 Recurrent Neural Networks, Rao et al. NeuIR 17 [4] However, these work all require two-stage retrieval: Initial retrieval: estimate the ground truth distribution (pseudo trend). Second retrieval: rerank docs with the estimated pseudo trend. 0.00 0.05 0.10 0.15 0.20

Research Question Research question: can we make use of the temporalstatistics of query terms (term trends) to predict the ground truth? What is term trend? Term frequencies in the collection for each 5 minutes. An example of ground truth and term trends for query MB127 hagel nomination filibustered from TREC 2013 topic set. ground truth Strong correlation! term trends

Approach: Temporal Modeling via Regression ground truth term trends Goal: Approximate the ground truth (Y) by taking a weighted sum of all term trends (ft).

Term Importance Modeling Bursty terms can be more informative. We adopt entropy definition to measure the importance of terms. Given the counts of a particular term t (unigram/bigram) {c 1, c 2,, c n }, lower entropy = bursty term trend = moreimportant

Approach: Temporal Modeling via Regression Two questions in this non-linear regression modeling: Q1: How to model the weights of different query terms? Q2: How to differentiate the contribution from unigrams with bigrams? Q1 solution: exponential mapping from entropy to term weight Q2 solution: assume unigram weight u i, then bigram weight (1-u i ) where Ri is the difference between the maximum unigram entropy and maximum bigram entropy. Intuition: Ri > 0 => max(unigram_entropy) > max(bigram_entropy) => u i > 0.5

Approach: Temporal Modeling via Regression Problem reformulation: Objective Loss: which can be solved with gradient descent algorithm (more details in paper).

Combine Term Trend with Pseudo Trend Two ways to estimate the ground truth distribution: Document-level: pseudo trend through an initial retrieval Term-level: regression over term trends Combine term trend and pseudo trend in a linear ranking model:

Experimental Setup Topic set: TREC Microblog Track 2013 and 2014, total 115 topics. Collection: Tweets2013 (~243 million tweets) Metrics: Mean Average Precision (AP) and Precision at 30 (P30) Three data splits: Odd-even: odd numbered topics (57 topics) for training, even (58 topics) for testing Even-odd: switch train/test split Cross: 4-fold cross validation

Baselines 1. QL 2. Recency Prior, Li et al. CIKM 03 [1] 3. Moving Window, Dakka et al.tkde 12 [2] 4. Kernel Density Estimation (KDE), Efron et al. SIGIR 14 [3] Uniform-based weighting (IRDu) Score-based weighting (IRDs) Rank-based weighting (IRDr) Oracle (upper bound)

Main Results Conclusions: KDE with rank-based weights (IRDr) is the strongest baseline. Our approach (Reg-IRDr) significantly outperforms all baselines, and is even close to the upper bound in some splits.

Randomized Experiments Average improvement over QL baseline summarized over 30 random train/test splits.

Per-Topic Analysis Per-topic P30 improvement against the Query Likelihood (QL) and the best KDE baseline (IRDr).

Analysis of the Best-Performing Topic 144 How term trend signals help? red color for ground truth distribution green for pseudo trend estimated by the best KDE method (IRDr) blue for term trends. Conclusion:A combination of pseudo trend (KDE) and term trend (Our approaches) provides a more accurate estimationto the ground truth distribution.

Conclusion We are the first to study temporal statistics of query terms for social media search. Our learning to rank and regression model show this new signal is effective. For efficiency purpose, use our term trending modeling technique For effectiveness purpose, use the combination of pseudo trend and term trend modeling

Thanks for listening! Any question?

Reference 1. Xiaoyan Li and W. Bruce Cro. 2003. Time-Based Language Models. In CIKM. 469 475. 2. Wisam Dakka, Luis Gravano, and Panagiotis G. Ipeirotis. 2012. Answering General Time- Sensitive eries. TKDE 3. Miles Efron, Jimmy Lin, Jiyin He, and Arjen de Vries. 2014. Temporal Feedback for Tweet Search with Non-Parametric Density Estimation. In SIGIR. 33 42. 4. Jinfeng Rao, Hua He, Haotian Zhang, Ferhan Ture, Royal Sequiera, Salman Mohammed, and Jimmy Lin. 2017. Integrating Lexical and Temporal Signals in Neural Ranking Models for Social Media Search. In SIGIR Workshop on Neural Information Retrieval (Neu-IR)