University of Waterloo: Logistic Regression and Reciprocal Rank Fusion at the Microblog Track

Similar documents
SNUMedinfo at TREC CDS track 2014: Medical case-based retrieval task

Northeastern University in TREC 2009 Million Query Track

Experiments with ClueWeb09: Relevance Feedback and Web Tracks

Exploring the Query Expansion Methods for Concept Based Representation

ICTNET at Web Track 2010 Diversity Task

BUPT at TREC 2009: Entity Track

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

UMass at TREC 2006: Enterprise Track

PRIS at TREC2012 KBA Track

University of Pittsburgh at TREC 2014 Microblog Track

Using Model-Theoretic Invariants for Semantic Integration. Michael Gruninger NIST / Institute for Systems Research University of Maryland

RMIT University at TREC 2006: Terabyte Track

FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track

QCRI at TREC 2014: Applying the KISS principle for the TTG task in the Microblog Track

Experiments on Related Entity Finding Track at TREC 2009 Qing Yang,Peng Jiang, Chunxia Zhang, Zhendong Niu

University of TREC 2009: Indexing half a billion web pages

On Duplicate Results in a Search Session

BIT at TREC 2009 Faceted Blog Distillation Task

A Methodology for End-to-End Evaluation of Arabic Document Image Processing Software

Multi-Modal Communication

Introducing I 3 CON. The Information Interpretation and Integration Conference

Service Level Agreements: An Approach to Software Lifecycle Management. CDR Leonard Gaines Naval Supply Systems Command 29 January 2003

FUDSChem. Brian Jordan With the assistance of Deb Walker. Formerly Used Defense Site Chemistry Database. USACE-Albuquerque District.

4. Lessons Learned in Introducing MBSE: 2009 to 2012

75th Air Base Wing. Effective Data Stewarding Measures in Support of EESOH-MIS

Empirically Based Analysis: The DDoS Case

COMPUTATIONAL FLUID DYNAMICS (CFD) ANALYSIS AND DEVELOPMENT OF HALON- REPLACEMENT FIRE EXTINGUISHING SYSTEMS (PHASE II)

Kathleen Fisher Program Manager, Information Innovation Office

Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track

Lessons Learned in Adapting a Software System to a Micro Computer

Dana Sinno MIT Lincoln Laboratory 244 Wood Street Lexington, MA phone:

Microsoft Research Asia at the Web Track of TREC 2009

Architecting for Resiliency Army s Common Operating Environment (COE) SERC

DoD Common Access Card Information Brief. Smart Card Project Managers Group

Information, Decision, & Complex Networks AFOSR/RTC Overview

2013 US State of Cybercrime Survey

ARINC653 AADL Annex Update

Distributed Real-Time Embedded Video Processing

Concept Based Tie-breaking and Maximal Marginal Relevance Retrieval in Microblog Retrieval

Defense Hotline Allegations Concerning Contractor-Invoiced Travel for U.S. Army Corps of Engineers' Contracts W912DY-10-D-0014 and W912DY-10-D-0024

Accuracy of Computed Water Surface Profiles

CENTER FOR ADVANCED ENERGY SYSTEM Rutgers University. Field Management for Industrial Assessment Centers Appointed By USDOE

Concept of Operations Discussion Summary

NOT(Faster Implementation ==> Better Algorithm), A Case Study

ENVIRONMENTAL MANAGEMENT SYSTEM WEB SITE (EMSWeb)

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

Running CyberCIEGE on Linux without Windows

Towards a Formal Pedigree Ontology for Level-One Sensor Fusion

CNIPA, FUB and University of Rome Tor Vergata at TREC 2008 Legal Track

VICTORY VALIDATION AN INTRODUCTION AND TECHNICAL OVERVIEW

MODELING AND SIMULATION OF LIQUID MOLDING PROCESSES. Pavel Simacek Center for Composite Materials University of Delaware

Using Templates to Support Crisis Action Mission Planning

Setting the Standard for Real-Time Digital Signal Processing Pentek Seminar Series. Digital IF Standardization

WAITING ON MORE THAN 64 HANDLES

ASSESSMENT OF A BAYESIAN MODEL AND TEST VALIDATION METHOD

A New Approach to Contextual Suggestions Based on Word2Vec

A Distributed Parallel Processing System for Command and Control Imagery

Full-texts representation with Medical Subject Headings, and co-citations network reranking strategies for TREC 2014 Clinical Decision Support Track

Web Site update. 21st HCAT Program Review Toronto, September 26, Keith Legg

DoD M&S Project: Standardized Documentation for Verification, Validation, and Accreditation

Development and Test of a Millimetre-Wave ISAR Images ATR System

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

Cyber Threat Prioritization

Approaches to Improving Transmon Qubits

High-Assurance Security/Safety on HPEC Systems: an Oxymoron?

Shedding Light on the Graph Schema

Basing a Modeling Environment on a General Purpose Theorem Prover

TNO and RUN at the TREC 2012 Contextual Suggestion Track: Recommending personalized touristic sights using Google Places

Topology Control from Bottom to Top

From Passages into Elements in XML Retrieval

Vision Protection Army Technology Objective (ATO) Overview for GVSET VIP Day. Sensors from Laser Weapons Date: 17 Jul 09 UNCLASSIFIED

Space and Missile Systems Center

A Review of the 2007 Air Force Inaugural Sustainability Report

Washington University

DEVELOPMENT OF A NOVEL MICROWAVE RADAR SYSTEM USING ANGULAR CORRELATION FOR THE DETECTION OF BURIED OBJECTS IN SANDY SOILS

32nd Annual Precise Time and Time Interval (PTTI) Meeting. Ed Butterline Symmetricom San Jose, CA, USA. Abstract

Wireless Connectivity of Swarms in Presence of Obstacles

Use of the Polarized Radiance Distribution Camera System in the RADYO Program

Indri at TREC 2004: Terabyte Track

C2-Simulation Interoperability in NATO

Technological Advances In Emergency Management

Reducing Redundancy with Anchor Text and Spam Priors

SURVIVABILITY ENHANCED RUN-FLAT

Components and Considerations in Building an Insider Threat Program

U.S. Army Research, Development and Engineering Command (IDAS) Briefer: Jason Morse ARMED Team Leader Ground System Survivability, TARDEC

IITH at CLEF 2017: Finding Relevant Tweets for Cultural Events

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Detection, Classification, & Identification of Objects in Cluttered Images

Headquarters U.S. Air Force. EMS Play-by-Play: Using Air Force Playbooks to Standardize EMS

Computer Aided Munitions Storage Planning

Edwards Air Force Base Accelerates Flight Test Data Analysis Using MATLAB and Math Works. John Bourgeois EDWARDS AFB, CA. PRESENTED ON: 10 June 2010

Fondazione Ugo Bordoni at TREC 2004

Army Research Laboratory

David W. Hyde US Army Engineer Waterways Experiment Station Vicksburg, Mississippi ABSTRACT

Automation Middleware and Algorithms for Robotic Underwater Sensor Networks

US Army Industry Day Conference Boeing SBIR/STTR Program Overview

Dr. Stuart Dickinson Dr. Donald H. Steinbrecher Naval Undersea Warfare Center, Newport, RI May 10, 2011

Using the SORASCS Prototype Web Portal

Space and Missile Systems Center

Folded Shell Projectors and Virtual Optimization

Transcription:

University of Waterloo: Logistic Regression and Reciprocal Rank Fusion at the Microblog Track Adam Roegiest and Gordon V. Cormack David R. Cheriton School of Computer Science, University of Waterloo 1 Introduction For the second iteration of the Microblog Track, two tasks were given to participants to complete. The first was to perform the same ad hoc search task as the 2011 iteration. The goal of the task was to expand on last year s methods with 60 new topics and to explore different measures of evaluation. The second task was to filter the corpus with respect to the 2011 topics in an attempt to simulate a streaming environment and how simulating user feedback can effect retrieval results. For the ad hoc search task, we decided to expand on last year s approach by continuing to use the Wumpus Search Engine 1 and adding in a logistic regression classifier (denoted GCLR in this article), first used in the TREC 2007 Spam Track [5]. In addition, pseudorelevance feedback was conducted this year by taking a swapdocs approach[3], which will be expanded upon later. As well, a semi-automatic logistic regression run was conducted using seed documents provided by a user. For the filtering task, only different methods of training GCLR were examined. No manual feedback was conducted for the filtering task. 2 Experimental Setup 2.1 Corpus From May 27th to June 3rd, the provided HTML crawler 2 was run after which the tweets were repaired using the provided repair code. Once it was determined that the HTML crawler returned tweets with null text bodies, the supplied repair code was modified to re-fetch tweets with null bodies. The modified repair script was run six times until it was no longer feasible to continue trying to repair tweets. Table 1 1 Available at http://www.wumpus-search.org 2 Available at github.com/lintool/twitter-corpus-tools Table 1: Corpus distribution HTTP Status Num. of Tweets Num. null text 200 14,313,888 14 302 1,137,183 25,571 403 138,560 138,560 404 552,181 552,181 shows the distribution of tweets in our copy of the corpus. We note that status code 200 denotes manually posted tweets, 302 denotes Twitter supported retweets, 403 denotes tweets the posting user has protected, and 404 denotes deleted tweets. We note that any tweet marked as 403 or 404 has no associated information as it is no longer available for public use. No further attempt was made to retrieve the remaining 25,585 tweets as we were unsure as to whether or not the tweets were still available, e.g. not protected or deleted. In addition, we desired to begin working with the corpus and did not believe the missing tweets would significantly affect our results. However, only those tweets which had a 200, 301, or 302 status code had the potential to be returned in any of the runs described in this work. 2.2 Ad Hoc Task Methodology The basic idea for our methodology in the ad hoc task was to extend our approach from last year by using a swapdocs approach to pseudo-relevance feedback. The swapdocs approach to pseudo-relevance feedback is to have different systems (in our case, ranking methods on different features and logistic regression) score documents with respect to a topic and then exchange each of the top k (k=20 in our case) results from the different systems as the basis for pseudo-relevance feedback instead of relying on some other form of document selection. Thus, the swapdocs approach can be seen as having two phases; the

Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE NOV 2012 2. REPORT TYPE 3. DATES COVERED 00-00-2012 to 00-00-2012 4. TITLE AND SUBTITLE University of Waterloo: Logistic Regression and Reciprocal Rank Fusion at the Microblog Track 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) University of Waterloo,David R. Cheriton School of Computer Science,200 University Avenue West,Waterloo, ON, Canada N2L 3G1, 8. PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR S ACRONYM(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 11. SPONSOR/MONITOR S REPORT NUMBER(S) 13. SUPPLEMENTARY NOTES Presented at the Twenty-First Text REtrieval Conference (TREC 2012) held in Gaithersburg, Maryland, November 6-9, 2012. The conference was co-sponsored by the National Institute of Standards and Technology (NIST) the Defense Advanced Research Projects Agency (DARPA) and the Advanced Research and Development Activity (ARDA). U.S. Government or Federal Rights License 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT a. REPORT unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified Same as Report (SAR) 18. NUMBER OF PAGES 6 19a. NAME OF RESPONSIBLE PERSON Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

initial seed document retrieval phase and a pseudorelevance feedback phase. This results in the creation of N 2 results (where N is the number of systems) and as with last year these results were combined using reciprocal-rank fusion (RRF) [6]. The formula used was as follows: RRF score(t) = Σ i 1 60 + r i (t) where r i (t) is the rank of the tweet t in result set i. For the first phase, GCLR 3 was trained on a 5:1 bias 4 of the topic statement to random tweets (selected from before the query time), such that there were 100 copies of the topic statement and 20 random tweets. For the second phase, GCLR was trained on the top 20 tweets from a system and 20 random tweets for each topic. In both phases, GCLR would classify all documents up to the query time. Our approach to Wumpus was much the same as it was last year. Feature engineering was performed for each query by creating six different indexes for Wumpus, where the index stores each tweet s features. Table 2 briefly describes the features used for each index. Tweets were ranked using five different methods, which are outlined in Table 3, on each index. In the initial phase, each ranking method for each index was provided just the topic statement. In the second phase, Okapi-style pseudo-relevance feedback [2] was conducted using the top 20 documents from each of the systems in the initial phase. However, due to Wumpus limitations CDR could not be used in the second phase. Due to poor performance on the 2011 topics, PC was not used in the second phase. In addition, pseudo-relevance feedback was performed using a language model of all tweets before the earliest query time in the 2012 topics. We note that the actual queries issued to Wumpus were not just the provided topic but each whitespace delimited word was treated as its own search term. If this was not done then Wumpus would attempt to look for tweets containing exactly the topic phrase and this is not generally a desirable behaviour for a search engine. Finally, for each set of results the only the the highest scoring 1000 tweets were used by RRF to combine results and only the top 1000 results from each run were submitted to NIST for evaluation. 3 GCLR uses byte 4-grams as features 4 The logistic regression classifier was trained on repeated examples of relevant documents so that the desired ratio of relevant to non-relevant was achieved. We call this ratio bias. Table 2: Description of engineered features Index Description 1 This index includes stemmed and unstemmed versions of words present in the tweet. The stemming is accomplished using a Porter stemmer. In addition, word bigrams are included in the index. 2 Identical to Index 1 but with no word bigrams present. 3 This index contains only unstemmed versions of words and word bigrams. 4 Identical to Index 3 but contains no word bigrams. 5 Instead of words as before, character 3-grams are used in the index. Subsequently, stemming is not used as it has no benefit. 6 Identical to Index 5 but uses character 4-grams. 2.2.1 Ad Hoc Runs In this section, the runs submitted to TREC for the ad hoc task are discussed. uwatrrflm - Baseline (RRF-2LM) This run was our baseline and did not use the swapdocs approach. It is a repeat of our best performing run in the 2011 Microblog Track, except that only Okapi-style feedback was used to select candidate documents (as opposed to the swapdocs approach) and two tweet based language models were used (one based on the 2011 topics and one based on the 2012 topics). uwatrrfall - Swapdocs Approach (RRF- Swapdocs) This run was conducted exactly as outlined above. uwatgclrbase - GCLR Initial Phase (GCLR- Initial) This run only contains the highest scoring 1000 tweets from GCLR in the first phase, e.g. trained on the 5:1 bias of topic statement. The purpose of this run was to determine a GCLR baseline from which we could judge improvements. uwatgclr - GCLR Manual Feedback (GCLR- Manual)

Table 3: Ranking Methods used Method Description Okapi BM25[7] BM25 is a bag-of-words retrieval method that ranks documents based upon how often query terms appear in the document and does not take into account any relationships between query terms in the document (e.g. proximity). From the 2011 topics we found that (b,k1)=(0.25, 0.4) Ponte-Croft Language Modelling[8] This method infers a nonparametric language model for each document and then ranks each document based on the query. That is, a document is ranked for a query using the product of the probability of producing query terms in the document and the probability of not producing other terms. Language Modelling with Dirichlet A language model, in this method, is treated as a multinomial distribution and the Dirichlet distribution is used to Priors[9] smooth the probability of word occurrence. Cover-Density Ranking[4] Given a set of query terms Q1,..., Qn, this method builds a boolean AND for all subsets (e.g. Q1^Q2^Q9^Qn) and ranks these subsets by the sum of their terms Inverse Document Frequency values. Then all documents are ranked based upon the largest subset they contain. Divergence from Randomness[1] This method uses inverse term frequency, the ratio of two Bernoulli processes, and the hypothesis that the term frequency density is inversely related to the length to create a ranking formula for documents. This run uses manual feedback with GCLR. The seed documents for GCLR relevance training came from a manual assessor who issued queries to Wumpus and selected relevant tweets from the retrieved results (all results were from before the query time). These were then trained in 1:1 fashion with randomly selected tweets (which were deemed not relevant). For topics with less than 20 tweets, the deficit was made up by including 5 copies of the topic statement for each tweet less. However, there was no maximum cut off for tweets selected. We note that the assessor was instructed to avoid selecting tweets with duplicate content to those previously selected. 2.3 Filtering Task Methodology For the Filtering task, we used only GCLR (with byte 4-grams as features) as our means of filtering. For the filtering task, we submitted two runs for official evaluation and they are outlined below. uwn - GCLR baseline This run was trained again in a 5:1 bias as was done in ad hoc, such that there were 1000 copies of the given relevant document, 1000 copies of the query, and 400 random tweets (trained as not relevant). After the training phase, GCLR classified all documents and no further training was performed during classification. uw - GCLR Feedback This run was trained again in a 5:1 bias as was done in ad hoc, such that there were 1000 copies of the given relevant document, 1000 copies of the query, and 400 random tweets (trained as not relevant). After the training phase, GCLR classified all documents. Any time a document was deemed to be relevant then GCLR was trained based upon it s actual relevance score in the 2011 qrels. If the document was deemed relevant in the qrels it was trained as relevant; if the document was deemed relevant in the qrels it was trained as non-relevant; if the document was absent from the qrels, no training was performed. 2.4 Evaluation Tweets for the ad hoc task were judged to be relevant by NIST assessors after the results from all participating teams were pooled. Assessors were asked to judge tweets as relevant, highly relevant, or not relevant with respect to the topic. In addition, only tweets that primarily contained the English language were

allowed to be judged relevant. The primary measure for the track this year was the ROC (receiver operating characteristic) curve, however, Ian Soboroff (a track coordinator) could not determine a means of unifying the ROC curves from different topics. Accordingly, we choose to look at Precision@30 which would allow us to compare our results with our runs from last year. However, as only summary information was given for highly relevant tweets this year, we only present the percentage of topics at and above the median for them. The average relevant returned at 30 for all relevant tweets was calculated using the highest scoring 30 tweets for each topic as we used last year. Tweets for the filtering task used the same judgements as those from last year as the filtering task was based upon the 2011 topic set. The filtering task had four measures set precision, set recall, F (with beta = 0.5), and linear utility. We present the number of topics for which our runs were at or above the median score for F, precision, and recall. 3 Results and Discussion Tables 4 and 5 provide ad hoc summary results for topics with highly relevant tweets only and topics with relevant tweets and present the results from the highest scoring run we performed last year. From these summary results, our baseline run (RRF-2LM) appears to have performed most of the time at or above median, which may indicate that it is a good baseline. This is also true when it s performance is at least as good as our best run from last year when you factor in that only 33 topics had highly relevant tweets last year (and only 49 with relevant tweets). The swapdocs run does appear to improve upon our baseline but not by a lot. To further examine our baseline run (RRF-2LM), we looked at the performance of the run by removing a language model and regardless of which language model was removed there appeared to be little change in our P@30 results. Although, the language model based upon this year s topics did have a better performance than our 2011 language model. This is not all that surprisingly because the language model for the 2012 topics was larger and thus more likely to be an accurate representation of the language used on Twitter during the collection period. It is worth noting that removing Ponte-Croft ranking from RRF-2LM did improve performance but not by a large enough margin that we thought it worth reporting in this paper. From these results, we can see something that is potentially interesting. It appears that for all the extra effort required of the swapdocs approach that there is not a huge increase in performance and it is possible that just adding GCLR trained on results from interim results may boost the performance of RRF-2LM enough that there is not enough benefit in using the swapdocs approach. In addition, anecdotal evidence would suggest that RRF-2LM is also a much more time efficient algorithm than RRF-Swapdocs (as it does not have to compute N 2 combinations). Accordingly, the gains made by using the swapdocs approach may be ruled out due to real world performance concerns. The ad hoc performance of GCLR with manually selected training documents was a nice a result and we would like to further explore how this approach can be modified to increase performance. As well, we would like to see if the results are boosted when the assessor is given a larger deadline for finding seed documents as the assessor used had a very small window for this run. Finally, the performance of initial swapdocs phase GCLR is unsurprising given that the training bias was quickly determined using 2011 topics and was not vigorously tested. Further, it appears that this approach is limited by the language used in topic statement as it appears from the results that the performance of GCLR-Initial was only as high as it was due to its performance on a few topics. Although, this would require manual verification by comparing individual topic performance to the topic statement to see if there is some correlation. The results for the filtering track were not particularly good, as we expected based on previous TREC Filtering Track experiments. We are at a loss to explain why the best scores for recall, precision, and F1 are so high. 4 Future Work Given the performance of GCLR with manually selected training documents, we are likely to continue exploring how performance can be increased further. As well, it would be nice to determine a means of incorporating manual intervention in the Filtering Task, perhaps by attempting to find relevant documents in the tweets that occur before the oldest tweet time condition in the filtering task. It may be beneficial to examine the role of different features in improving tweet classification using GCLR and perhaps determine if there is an ideal bias when training classification/filtering methods for tweets. In addition, it would likely boost performance to

have a pre-screening stage of filtering where all non- English tweets are filtered out so that there is less potential of returning non-english tweets. We had desired to do this last year but felt that it may be a waste of time to attempt to do so this year given various time constraints. 5 Conclusions This year at the Microblog Track, we saw that performance of all systems for the ad hoc task seemed to be much improved from the 2011 iteration. In particular, that manually seeding logistic regression achieved good results and that applying swapdocs to the task may not be as beneficial as one might hope. The filtering task was largely underwhelming and we did not achieve very good results and are unsure of how the best scores were achieved. [7] K. Sparck Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: development and comparative experiments. In Information Processing and Management, pages 779 840, 2000. [8] Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 98, pages 275 281, 1998. [9] Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR 01, pages 334 342, 2001. References [1] Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst., 20:357 389, October 2002. [2] Bodo Billerbeck and Justin Zobel. Questioning query expansion: an examination of behaviour and parameters. In Proceedings of the 15th Australasian database conference - Volume 27, ADC 04, pages 69 76, 2004. [3] Charles L. Clarke, Gordon V. Cormack, Thomas R. Lynam, Chris Buckley, and Donna Harman. Swapping documents and terms. Inf. Retr., 12(6):680 694, Dec 2009. [4] Charles L. A. Clarke, Gordon V. Cormack, and Elizabeth A. Tudhope. Relevance ranking for one to three term queries, 2000. [5] Gordon V. Cormack. University of waterloo participation in the trec 2007 spam track. In Sixteenth Text REtrieval Conference (TREC-2007), Gaithersburg, MD, 2007. NIST. [6] Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR 09, pages 758 759, 2009.

Table 4: Ad Hoc Results for highly relevant tweets only 2012 Runs 2011 Runs GCLR-Initial GCLR-Manual RRF-Swapdocs RRF-2LM RRF-1LM Percent above median 10.17 64.41 57.63 54.24 66.67 Percent at median 25.42 20.34 27.12 23.72 21.21 Avg. rel. ret. 30 3.14 8.03 6.59 6.19 3.82 Table 5: Ad Hoc Results for all relevant tweets 2012 Runs 2011 Runs GCLR-Initial GCLR-Manual RRF-Swapdocs RRF-2LM RRF-1LM Avg. rel. ret. 30 6.83 13.71 11.64 11.37 12.65 Table 6: Filtering results for F-measure, Recall and Precision Runs F (beta=0.5) Recall Precision Above median At median Above median At median Above median At median uwn 7 3 7 1 26 3 uw 3 1 3 1 36 2