NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

Similar documents
CLEF-IP 2009: Exploring Standard IR Techniques on Patent Retrieval

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Effective Tweet Contextualization with Hashtags Performance Prediction and Multi-Document Summarization

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

CS646 (Fall 2016) Homework 1

University of Pittsburgh at TREC 2014 Microblog Track

Automatic Query Type Identification Based on Click Through Information

ECNU at 2017 ehealth Task 2: Technologically Assisted Reviews in Empirical Medicine

A New Technique to Optimize User s Browsing Session using Data Mining

BUPT at TREC 2009: Entity Track

NUS-I2R: Learning a Combined System for Entity Linking

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Identifying Web Spam With User Behavior Analysis

Informativeness for Adhoc IR Evaluation:

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

Improving Synoptic Querying for Source Retrieval

TREC-10 Web Track Experiments at MSRA

NTUBROWS System for NTCIR-7. Information Retrieval for Question Answering

University of Delaware at Diversity Task of Web Track 2010

TEXT CHAPTER 5. W. Bruce Croft BACKGROUND

Reducing Redundancy with Anchor Text and Spam Priors

BUAA AUDR at ImageCLEF 2012 Photo Annotation Task

UMass at TREC 2006: Enterprise Track

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

The Impact of Future Term Statistics in Real-Time Tweet Search

Experiments on Related Entity Finding Track at TREC 2009 Qing Yang,Peng Jiang, Chunxia Zhang, Zhendong Niu

Automatic Query Type Identification Based on Click Through Information. State Key Lab of Intelligent Tech. & Sys Tsinghua University

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Video annotation based on adaptive annular spatial partition scheme

Information Retrieval

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

BIT at TREC 2009 Faceted Blog Distillation Task

Time-aware Approaches to Information Retrieval

Webis at the TREC 2012 Session Track

A Multilingual Social Media Linguistic Corpus

Heading-aware Snippet Generation for Web Search

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

University of Waterloo: Logistic Regression and Reciprocal Rank Fusion at the Microblog Track

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan

What is this Song About?: Identification of Keywords in Bollywood Lyrics

The University of Illinois Graduate School of Library and Information Science at TREC 2011

A Deep Relevance Matching Model for Ad-hoc Retrieval

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Book Recommendation based on Social Information

Query Expansion for Noisy Legal Documents

Retrieval and Feedback Models for Blog Distillation

CLEF 2017 Microblog Cultural Contextualization Lab Overview

Team COMMIT at TREC 2011

ResPubliQA 2010

Search Engines Information Retrieval in Practice

Northeastern University in TREC 2009 Web Track

Automatic Search Engine Evaluation with Click-through Data Analysis. Yiqun Liu State Key Lab of Intelligent Tech. & Sys Jun.

SENTIMENT ESTIMATION OF TWEETS BY LEARNING SOCIAL BOOKMARK DATA

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 17, NO. 7, JULY Relational User Attribute Inference in Social Media

Evaluation of Meta-Search Engine Merge Algorithms

Query Likelihood with Negative Query Generation

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Making Retrieval Faster Through Document Clustering

Microsoft Research Asia at the Web Track of TREC 2009

Document Structure Analysis in Associative Patent Retrieval

with Twitter, especially for tweets related to railway operational conditions, television programs and landmarks

NLP Final Project Fall 2015, Due Friday, December 18

AUTOMATIC VISUAL CONCEPT DETECTION IN VIDEOS

A Formal Approach to Score Normalization for Meta-search

Developing Focused Crawlers for Genre Specific Search Engines

Charles University at CLEF 2007 CL-SR Track

A Practical Passage-based Approach for Chinese Document Retrieval

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

A Tweet Classification Model Based on Dynamic and Static Component Topic Vectors

Predicting Next Search Actions with Search Engine Query Logs

over Multi Label Images

VK Multimedia Information Systems

Ph.D. in Computer Science & Technology, Tsinghua University, Beijing, China, 2007

A Bayesian Approach to Hybrid Image Retrieval

An Information Retrieval Approach for Source Code Plagiarism Detection

Incorporating Temporal Information in Microblog Retrieval

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Chapter IR:IV. IV. Indexing. Abstract Model of Ranking Inverted Indexes Compression Auxiliary Structures Index Construction Query Processing

Hyperlink-Extended Pseudo Relevance Feedback for Improved. Microblog Retrieval

A Study of Pattern-based Subtopic Discovery and Integration in the Web Track

Research Article Apriori Association Rule Algorithms using VMware Environment

A Fast and High Throughput SQL Query System for Big Data

Axiomatic Approaches to Information Retrieval - University of Delaware at TREC 2009 Million Query and Web Tracks

Microblog Sentiment Analysis with Emoticon Space Model

CS 6320 Natural Language Processing

An Improvement of Search Results Access by Designing a Search Engine Result Page with a Clustering Technique

ICTNET at Web Track 2010 Diversity Task

Marketing & Back Office Management

Dynamic Embeddings for User Profiling in Twitter

The Design of a Live Social Observatory System

UMass at TREC 2017 Common Core Track

Extracting Information from Social Networks

for Searching Social Media Posts

Overview of the TREC-2011 Microblog Track

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Transcription:

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags Hadi Amiri 1,, Yang Bao 2,, Anqi Cui 3,,*, Anindya Datta 2,, Fang Fang 2,, Xiaoying Xu 2, 1 Department of Computer Science, School of Computing, National University of Singapore, Singapore 117543. E-mail: hadi@nus.edu.sg 2 Department of Information Systems, School of Computing, National University of Singapore, Singapore 117543. E-mails: {baoyang, datta, fangfang, xu1987}@comp.nus.edu.sg 3 State Key Lab of Intelligent Technology & Systems, Tsinghua National Lab for Information Science & Technology, Dept. of Computer Science & Technology, Tsinghua University, Beijing, 100084, China. E-mail: cuianqi@gmail.com All authors contribute equally to the work, and are listed in alphabetical order of their surnames. Abstract: In this paper, we describe our submission to the TREC 2011 Microblog track. We first use URLs as a clue to discover and remove the spam tweets. Then we use both Lucene and Indri to generate a ranked list of results for each query, together with their relevance scores. After that, we use the scores to find out useful hashtags relevant to the query, therefore some previously lower-ranked tweets can be discovered and are re-ranked higher. Query reformulation is considered in two of the four runs in our submissions. 1 Introduction The NUSIS team participated in the Realtime Adhoc task component of the TREC 2011 Microblog track. The team comprises members from the National University of Singapore and Tsinghua University. We analyzed the task and evaluation methods carefully, and submitted results based on our best understanding. The task required at least one run with no external and future evidence involved. To achieve this, we first filtered out the past tweets according to the timestamp of each of the 50 query topics. Then we ran our algorithm 50 times, generating results for these topics. We provide a broad overview of our approach. We start off by removing spam tweets. Tweets are usually in short texts; hence most spam tweets contain URLs of their own websites, so that users will visit the den from this link. Therefore, presence of URLs in a tweet provides strong evidence of it being spam. We find that many popular URLs are spam URLs [1], and conduct a simple method to remove these tweets. We constructed indices with Lucene [2] and Indri (for query reformulation) [3], obtaining a relevance score of each tweet in the process. Instead of directly submitting the first 30 results, we adopt an algorithm that generates modified scores for these tweets, which, in turn, re-ranks the current list. Since some tweets may be ranked higher after this modification, the submitted results can be different. In the end, the tweets are sorted by these refined scores (for the adhoc evaluation) or by a combination of both their time and their modified scores (for the balanced evaluation). The basic idea is to discover relevant hashtags for a query topic. We believe that, when generating * This work has been done by the support of Tsinghua-NUS NExT Search Center.

results with traditional text information, the hashtags mentioned in these tweets are also of interest. Therefore, other tweets containing these hashtags but less relevant texts are also relevant to this topic. In this way, more tweets of interest are discovered. 2 Dataset and Preprocessing The tweet dataset is downloaded using the twitter-corpus-download-tool [4]. A total of 13,660,436 tweets are downloaded with status code 200. Tweets with status code 302 (1,093,549 tweets), as well as tweets with code 200 but start with RT, are retweets, which are defined as non-relevant, thus are removed. Some other tweets (1,378,120 tweets) are removed (with code 404) or are protected (with code 403), preventing us from accessing them. According to the task description, all non-english tweets are considered as non-relevant. Similar as the preprocessing step in a previous work [5], we remove these tweets based on the characters in the text, i.e. tweets containing characters other than Basic Latin and symbols are removed. Many tweets are too short which result in little information. Tweets with less than five words are discarded. As mentioned before, tweets with spam URLs are removed. The spam URLs are determined from the following two aspects: (1) For each URL u which appears more than five times in the corpus, it is considered as a spam URL if n u / N u 0.4, where n u is the number of unique users who have posted tweets containing u, and N u is the total number of times u has appeared. (2) We also examine the dataset and find some popular domains with high occurrences. Therefore, we manually identified a set of spam domain including: tinychat, twittascope, twitcam, and twitcast. All URLs in these spam domain are considered as spam URLs. 3 Indexing and Query Reformulations The indices and relevance scores are generated by Lucene [2] and Indri [3] retrieval systems. We use the default implementation of the popular vector-space model in Lucene. Indri provides a robust query language that both accept keyword queries and complex queries. This language model provides many features like complex phrase matching, weighted expressions, and Boolean filtering, etc. We utilize these features to formulate the queries for the task. In particular for each query we first extract its N-Grams (N = 2, 3). We then construct some weighted expressions using the query and its N-grams. The reason that we use weighted expressions is because of the fact that it allows controlling the impact of each expression (e.g. by varying weights). Let the query q = t 1, t 2,, t n, where t i indicates the i-th term of the query. Using the Indri s query language we construct the following sub-queries: (1) The whole query with weight 2.0 (2) Its 3-Grams with weight 1.5 (3) Its 2-Grams with weight 1.0, and (4) All the query terms with weight 0.5.

The following query is then obtained: Weight ( 2.0 #M (t 1 t 2 t n ) 1.5 #N (t 1 t 2 t 3 ) 1.5 #N (t 2 t 3 t 4 ) 1.5 #N (t n-2 t n-1 t n ) 1.0 #N (t 1 t 2 ) 1.0 #N(t 2 t 3 ) 1.0 #N (t n-1 t n ) 0.5 t 1 0.5 t 2 0.5 t n ), where #X indicates that the terms should be in order with the max distance of X-1. We set M and N to 30 and 5 respectively in our experiments. We then perform the retrieval using the above reformulations. 4 Refining with Hashtags The hashtag (word starts with a hash symbol #) is a specific feature in tweets. They usually denote the category or topic of the tweets, to provide a useful annotation. Hashtags are provided by the original author of the tweet; we believe them much relevant to the content, thus can be used to improve the query results. Formally, let the results without refining be {r n }, where each r i consists of the tweet t i and its relevance score s i. Let {h i, m } be the hashtags in t i. Since the first 30 results are more important in the evaluation, we scan the first 30 tweets to assign scores to the hashtags. For each of these tweets, denote s i as the score of the hastag h i, m. Therefore, hashtags that appear earlier in the result list (in the tweet ranked higher) have higher scores. Note a same hashtag may occur more than once in the tweets (or even in the tweet t i itself); the hashtag will be assigned scores multiple times. Then, for each unique hashtag h k, we add up all the scores it has been assigned. Hence, hashtags that occur more times will have a higher final score, namely S(h k ). The final score of a tweet is combined with both the original relevance score s i and the scores from the hashtags it contains, Σ k S(h i, k ), controlled by two weight factors w 1 and w 2 : finalscore(t i ) = w 1 s i + (1 w 1 ) Σ k S(h i, k ) + w 2 s i Σ k S(h i, k ) In practice, we set w 1 = 0.85 and w 2 = 0.07. Finally, all the retrieved tweets (mostly more than 30) are re-ranked by the final score. Then the first 30 results are submitted for the first evaluation. For the balanced evaluation, we consider both the time a tweet is created and its refined final score. We sort the result list by the time and score separately, thus for each tweet we have two ranks. Then we add the two ranks together to get a new index for the tweet, and sort the list again by this index. Finally, the first 30 results from this newly sorted list are submitted. 5 Results and Discussion According to the judgments, the evaluation includes scores from all 49 topics (topic 50 is dropped)

and from 33 topics which have highly relevant tweets. The primary measure is P@30. We submit four runs, two with Lucene indexing and two with Indri (query reformulation). For each indexing type, we submit a relevance result and a balanced result. Our performances are shown in Fig.1 Fig. 3. The solid lines ( balance and relevance ) are generated from Lucene, while the dashed lines ( refbal and refrel ) are from Indri with query reformulation. We compare the results (of these four runs) with the median performance of the participants, together with the baseline provided by TREC. Fig. 1 Mean average precision on each query topic

Fig. 2 R-Precision on each query topic Fig. 3 Relevant Retr @ 30 on each query topic From the figures we see clearly that our runs outperform the median performances in most topics.

Note that the baseline is based on Lucene without any further post-processing. Compared with the relevance line (Lucene + post-processing), we find that our post-processing methods improve the performance most of the time. However, in the cases baseline is better, it may due to the noises in the hashtags. Another finding is that ranking with relevance is better than the balanced one. Although the task itself announced that the evaluation is from two aspects, we find the provided judgment (the official evaluation) considers the tweet IDs (represents the time a tweet is posted) retrieved in descending order as the rank order of the run. Under this consideration, our strategy may be designed differently to achieve a better performance. For the comparison between with and without query reformulation, there is no significant difference in general. 6 Conclusions In this paper, we describe our efforts in the participation of the TREC 2011 Microblog track. We design some specific methods both in preprocessing and post-processing, to fully utilize the feature of tweets, i.e. spam URLs and hashtags. Although the query model is simple, we find that these methods are helpful to discover relevant tweets. For the future work, we expect to examine the provided judgment in detail to find out the features that bring a tweet to be relevant. Although tweet texts are short, there will be some specific characteristics that are helpful for tweets retrieval. Acknowledgement The NExT Search Centre is supported by the Singapore National Research Foundation & Interactive Digital Media R&D Program Office, MDA under research grant (WBS: R-252-300-001-490). References [1] Anqi Cui, Min Zhang, Yiqun Liu, Shaoping Ma. Are the URLs Really Popular in Microblog Messages?. Proceedings of the 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS2011). Beijing, 2011. [2] Apache Lucene. http://lucene.apache.org/java/docs/. Visited on 1 Aug, 2011. [3] Donald Metzler, W. Bruce Croft. Combining the Language Model and Inference Network Approaches to Retrieval. Information Processing and Management: Special Issue on Bayesian Networks and Information Retrieval, 40(5): 735-750, 2004. [4] Jimmy Lin. twitter-corpus-tools, https://github.com/lintool/twitter-corpus-tools. Visited on 1 Aug 2011. [5] Anqi Cui, Min Zhang, Yiqun Liu, Shaoping Ma. Emotion Tokens: Bridging the Gap Among Multilingual Twitter Sentiment Analysis. Proceedings of the 7th Asia Information Retrieval Societies Conference (AIRS2011). Dubai, 2011.