UMass at TREC 2006: Enterprise Track

Similar documents
BUPT at TREC 2009: Entity Track

Exploring the Query Expansion Methods for Concept Based Representation

Experiments with ClueWeb09: Relevance Feedback and Web Tracks

SNUMedinfo at TREC CDS track 2014: Medical case-based retrieval task

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Empirically Based Analysis: The DDoS Case

Towards a Formal Pedigree Ontology for Level-One Sensor Fusion

COMPUTATIONAL FLUID DYNAMICS (CFD) ANALYSIS AND DEVELOPMENT OF HALON- REPLACEMENT FIRE EXTINGUISHING SYSTEMS (PHASE II)

Northeastern University in TREC 2009 Million Query Track

Using Model-Theoretic Invariants for Semantic Integration. Michael Gruninger NIST / Institute for Systems Research University of Maryland

Service Level Agreements: An Approach to Software Lifecycle Management. CDR Leonard Gaines Naval Supply Systems Command 29 January 2003

BIT at TREC 2009 Faceted Blog Distillation Task

Kathleen Fisher Program Manager, Information Innovation Office

Multi-Modal Communication

Dana Sinno MIT Lincoln Laboratory 244 Wood Street Lexington, MA phone:

ARINC653 AADL Annex Update

FUDSChem. Brian Jordan With the assistance of Deb Walker. Formerly Used Defense Site Chemistry Database. USACE-Albuquerque District.

Experiments on Related Entity Finding Track at TREC 2009 Qing Yang,Peng Jiang, Chunxia Zhang, Zhendong Niu

DoD Common Access Card Information Brief. Smart Card Project Managers Group

Information, Decision, & Complex Networks AFOSR/RTC Overview

Concept of Operations Discussion Summary

A Distributed Parallel Processing System for Command and Control Imagery

4. Lessons Learned in Introducing MBSE: 2009 to 2012

Accuracy of Computed Water Surface Profiles

A Methodology for End-to-End Evaluation of Arabic Document Image Processing Software

Using Templates to Support Crisis Action Mission Planning

PRIS at TREC2012 KBA Track

University of TREC 2009: Indexing half a billion web pages

ICTNET at Web Track 2010 Diversity Task

A Review of the 2007 Air Force Inaugural Sustainability Report

Running CyberCIEGE on Linux without Windows

Energy Security: A Global Challenge

Indri at TREC 2004: Terabyte Track

Topology Control from Bottom to Top

2013 US State of Cybercrime Survey

WAITING ON MORE THAN 64 HANDLES

Microsoft Research Asia at the Web Track of TREC 2009

ASSESSMENT OF A BAYESIAN MODEL AND TEST VALIDATION METHOD

Headquarters U.S. Air Force. EMS Play-by-Play: Using Air Force Playbooks to Standardize EMS

Monte Carlo Techniques for Estimating Power in Aircraft T&E Tests. Todd Remund Dr. William Kitto EDWARDS AFB, CA. July 2011

Introducing I 3 CON. The Information Interpretation and Integration Conference

Next generation imager performance model

Corrosion Prevention and Control Database. Bob Barbin 07 February 2011 ASETSDefense 2011

Setting the Standard for Real-Time Digital Signal Processing Pentek Seminar Series. Digital IF Standardization

Cyber Threat Prioritization

ENVIRONMENTAL MANAGEMENT SYSTEM WEB SITE (EMSWeb)

DEVELOPMENT OF A NOVEL MICROWAVE RADAR SYSTEM USING ANGULAR CORRELATION FOR THE DETECTION OF BURIED OBJECTS IN SANDY SOILS

FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

75th Air Base Wing. Effective Data Stewarding Measures in Support of EESOH-MIS

High-Assurance Security/Safety on HPEC Systems: an Oxymoron?

VICTORY VALIDATION AN INTRODUCTION AND TECHNICAL OVERVIEW

C2-Simulation Interoperability in NATO

Indri at TREC 2005: Terabyte Track (Notebook Version)

SINOVIA An open approach for heterogeneous ISR systems inter-operability

Web Site update. 21st HCAT Program Review Toronto, September 26, Keith Legg

Use of the Polarized Radiance Distribution Camera System in the RADYO Program

Shallow Ocean Bottom BRDF Prediction, Modeling, and Inversion via Simulation with Surface/Volume Data Derived from X-Ray Tomography

University of Pittsburgh at TREC 2014 Microblog Track

MODELING AND SIMULATION OF LIQUID MOLDING PROCESSES. Pavel Simacek Center for Composite Materials University of Delaware

Vision Protection Army Technology Objective (ATO) Overview for GVSET VIP Day. Sensors from Laser Weapons Date: 17 Jul 09 UNCLASSIFIED

Using the SORASCS Prototype Web Portal

Washington University

Basing a Modeling Environment on a General Purpose Theorem Prover

Space and Missile Systems Center

73rd MORSS CD Cover Page UNCLASSIFIED DISCLOSURE FORM CD Presentation

Distributed Real-Time Embedded Video Processing

Wireless Connectivity of Swarms in Presence of Obstacles

ATCCIS Replication Mechanism (ARM)

Lessons Learned in Adapting a Software System to a Micro Computer

The C2 Workstation and Data Replication over Disadvantaged Tactical Communication Links

CENTER FOR ADVANCED ENERGY SYSTEM Rutgers University. Field Management for Industrial Assessment Centers Appointed By USDOE

Components and Considerations in Building an Insider Threat Program

Opinions in Federated Search: University of Lugano at TREC 2014 Federated Web Search Track

Data Warehousing. HCAT Program Review Park City, UT July Keith Legg,

David W. Hyde US Army Engineer Waterways Experiment Station Vicksburg, Mississippi ABSTRACT

HEC-FFA Flood Frequency Analysis

LARGE AREA, REAL TIME INSPECTION OF ROCKET MOTORS USING A NOVEL HANDHELD ULTRASOUND CAMERA

SETTING UP AN NTP SERVER AT THE ROYAL OBSERVATORY OF BELGIUM

Additional Remarks on Designing Category-Level Attributes for Discriminative Visual Recognition

DoD M&S Project: Standardized Documentation for Verification, Validation, and Accreditation

ASPECTS OF USE OF CFD FOR UAV CONFIGURATION DESIGN

Advanced Numerical Methods for Numerical Weather Prediction

On Duplicate Results in a Search Session

SCICEX Data Stewardship: FY2012 Report

Space and Missile Systems Center

Architecting for Resiliency Army s Common Operating Environment (COE) SERC

INTEGRATING LOCAL AND GLOBAL NAVIGATION IN UNMANNED GROUND VEHICLES

Technological Advances In Emergency Management

Edwards Air Force Base Accelerates Flight Test Data Analysis Using MATLAB and Math Works. John Bourgeois EDWARDS AFB, CA. PRESENTED ON: 10 June 2010

An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs

U.S. Army Research, Development and Engineering Command (IDAS) Briefer: Jason Morse ARMED Team Leader Ground System Survivability, TARDEC

Balancing Transport and Physical Layers in Wireless Ad Hoc Networks: Jointly Optimal Congestion Control and Power Control

Virtual Prototyping with SPEED

QuanTM Architecture for Web Services

US Army Industry Day Conference Boeing SBIR/STTR Program Overview

CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster

Defense Hotline Allegations Concerning Contractor-Invoiced Travel for U.S. Army Corps of Engineers' Contracts W912DY-10-D-0014 and W912DY-10-D-0024

NATO-IST-124 Experimentation Instructions

Dr. ing. Rune Aasgaard SINTEF Applied Mathematics PO Box 124 Blindern N-0314 Oslo NORWAY

Transcription:

UMass at TREC 2006: Enterprise Track Desislava Petkova and W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science University of Massachusetts, Amherst, MA 01003 Abstract This paper gives an overview of the work done at the University of Massachusetts, Amherst for the TREC 2006 Enterprise track. For the discussion search task, we compare two methods for incorporating thread evidence into the language models of email messages. For the expert finding task, we create implicit expert representations as mixtures of language models from associated documents. 1 Introduction In this paper we discuss two retrieval tasks in an enterprise setting: finding email messages which contain pro or con arguments on a given topic, and finding people who are knowledgeable in particular area. Both tasks offer unique challenges that are particular to enterprise search: diversity of structured and unstructured documents, variety of languages and formats, existence of social networks, security requirements. In our work we consider some of these issues, in particular document structure and heterogeneity. 2 Data processing For our experiments, we used the Indri search engine in the Lemur toolkit [5]. Its powerful query language allows formulating richly structured queries and incorporating various sources of contextual evidence. We preprocessed the www and lists subcollections of the W3c corpus, removing HTML tags from web documents and stripping quoted text and signature lines from email documents. We used Jangada [1] to extract signature blocks and reply-to lines from emails. Finally, we used the all-in-reply-to list compiled by William Webber to group emails by thread. 3 Discussion Search Emails form a considerable part of the communication in an organization and are characterized by rich internal and external structure. Previous work has shown that email structure is a useful source of information in known-item finding [4]. We exploit internal structure by weighting header and mainbody text differently and external structure by adding a third component corresponding to thread text. In the header we combine evidence from the subject, date, to, from and cc fields. The mainbody is the original text of a message with reply-to, forward and signature lines removed. And the thread is the concatenated text of messages in the tree-like structure of an email conversation. One approach for incorporating thread context is to estimate language model for the thread and interpolate it with the smoothed language models of the other email components. This corresponds to the following retrieval model: P (t D) = λ 1 P (t D header ) + λ 2 P (t D mainbody ) + λ 3 P (t D thread ) where λ 1 + λ 2 + λ 3 = 1 and P (t D component ) = P MLE (t D component ) + αp (t Corpus)

Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 2006 2. REPORT TYPE 3. DATES COVERED 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE UMass at TREC 2006: Enterprise Track 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) University of Massachusetts,Center for Intelligent Information Retrieval,Department of Computer Science,Amherst,MA,01003 8. PERFORMING ORGANIZATION REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR S ACRONYM(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES 14. ABSTRACT 15. SUBJECT TERMS 11. SPONSOR/MONITOR S REPORT NUMBER(S) 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT a. REPORT b. ABSTRACT c. THIS PAGE 18. NUMBER OF PAGES 4 19a. NAME OF RESPONSIBLE PERSON Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

with α being a general symbol for smoothing. An alternative way to take advantage of thread information is to use it as a background model for smoothing maximum likelihood estimates from the header and mainbody. The hypothesis is that threads will provide a more reasonable fallback distribution than a word distribution for general English. where P (t D) = λ 1 P (t D header ) + λ 2 P (t D body ) P (t D component ) = P MLE (t D component ) 3.1 Runs + αp (t D thread ) P (t T hread) = P MLE (t T hread) + βp (t Corpus) We submitted four official runs for evaluation. For all runs we the Term Dependency Model [3] and the Relevance Model [2] for pseudo-relevance feedback to expand the query. Term dependency increases precision by adding proximity features and pseudo-relevance feedback increases recall by adding terms related to the original query. We use the following mixing weights for the components models: λ header = 0.3,λ mainbody = 0.7,λ thread = 0.3. 1. Baseline: (Unofficial) Uses the entire email content (header fields and mainbody) without breaking it up into components. 2. UMaTiMixHdr: Emails are divided into header and mainbody. The components are smoothed independently with background models of all header text and all mainbody text. 3. UMaTiSmoThr: The header and mainbody components are both smoothed with the thread which a particular message is part of. (Messages are not part of a thread are smoothed with the collection, as in UMaTiMixHdr.) 4. UMaTiMixThr: Emails have three components - header, mainbody and text, smoothed independently with background models of header, mainbody and thread text. 5. UMaTDMixThr: Similar to UMaTiMixThr but also using the description field of the query topic. Run ID map R-prec RR1 P@5 P@10 Baseline.2905.3367.6834.5240.4700 UMaTiMixHdr.3058.3567.6673.5520.5240 UMaTiSmoThr.3197.3672.7081.5800.5340 UMaTiMixThr.3373.3715.6538.5440.5300 UMaTDMixThr.3631.3963.7134.5880.5820 Table 1: Discussion search: submitted runs. Results show that smoothing with a thread-based fallback model is more effective than smoothing with a general collection model. However, constructing a mixture of language models from header, main body and thread text is even more effective. And corroborating previous research, internal and external email structure is an useful source of information. 4 Expert Search We propose a formal method for constructing hierarchical expert representations, based on statistical relevance modeling. Our main goal is that this process takes advantage of various information sources and prior knowledge about the experts, the collection or the domain. 4.1 Model Our expert modeling approach includes the following steps: 1. Define what is a reference to E, so that occurrences of E can be detected. 2. Rank and retrieve profile documents. We use language modeling with Dirichlet smoothing. 3. For each document D in the profile set S E, compute the posterior P (D E), assuming that the prior distribution is uniform. P (D E) = P (E D)P (D) P (E)

4. Form a term distribution for E by incorporating the document model P (t D) and marginalizing. P (t D) is the maximum likelihood estimate. P (t E) = D S E P (t D)P (D E) To summarize, we represent an expert as a mixture of documents, where the mixing weights are specified by the posterior distribution P (D E). Once we have built models for all candidates, we find experts relevant to a particular topic Q by ranking the candidates according to query likelihood. The result of Eq. (4) is a probability distribution of words describing the context of an expert s name, where P ( E) is estimated using a particular name definition and from a homogeneous collection of documents. Representations estimated from different collections or alternative name definitions can be interpolated to build hierarchical expert models. P (t E) = c C 4.2 Experiments λ c P (t E c ), c C λ c = 1 With our experiments we want to determine whether hierarchical expert models can effectively create rich representations from heterogeneous data and answer complex queries. Heterogeneous sources The W3C corpus is composed of several subcollections comprising documents of particular type. We independently build a language model from one subcollection at a time and then represent an expert as a mixture of those models. This allows us to treat each subcollection differently according to its specific intrinsic properties, e.g. when smoothing to estimate P (E D), as well as to weight the information sources. We use the lists subcollection (average length 450 words after preprocessing) and the www collection (average length 2000). We automatically set the Dirichlet smoothing µ parameter to the average document length, and we experimentally determine an optimal value for the mixing parameter λ www = 0.6. Results are reported in Table 2. Although models built from www outperform models built from lists, by combining the two we achieve an even better performance, indicating that email discussion lists contain information not contained in the web pages. Term dependency An important feature of our model is that it preserves information inherent in documents, such as structure and term positions. Consequently, it can capture higher-level language features such as relationships between terms. Note that because experts are modeled indirectly as a set of documents, it is possible that query terms appear in the profile set but do not co-occur in any document within that set. We implement term dependency as described by Metzler and Croft [3], using both sequential dependency and full dependency between query terms to include restrictions on terms appearing in close proximity in the text. Results (Table 2) show consistent improvement in mean average precision. Combining expert definitions Finally we compare two rules to define a person entity: LAST matches the last name of a candidate, and FULL matches the first and last name separated by at most two words. LAST is a loose definition since some people have the same family name. FULL is more strict but misses some true associations, since people are not necessarily referred to with their full names, especially in emails. This is an example of the tradeoff between recall and precision. The profile set from a loose definition is larger but more ambiguous as some associations are incorrect. On the other hand, the profile set from a strict definition is smaller but more precise as retrieved documents are reliably associated with the person but at the same time valid documents are overlooked. Combining two expert definitions, LAST and FULL gives a slightly better performance than either alternative separately (Table 2).

Entity definition Documents Term dependency LAST FULL lists www No Yes X X.1841.2008 X X.2387.2922 X X.2679.3029 X X.3127.3732 X X X.2689.3220 X X X.3460.4139 X X X X.3514.4216 Table 2: map is incrementally improved by combining several representations to form a hierarchy. Mixing parameters: λ F ULL = 0.9,λ LAST = 0.1,λ www = 0.6,λ lists = 0.4 4.3 Runs We submitted four official runs which use the hierarchical representation described above. We expand the title-only query with the description and narrative fields, and with pseudo-relevance feedback terms. 1. UMaTiDm: Title / Term dependency 2. UMaTNDm: Title+Description / Term dependency 3. UMaTNFb: Title+Description / Term dependency / Pseudo relevance feedback 4. UMaTDFb: Title+Description+Narrative / Term dependency / Pseudo relevance feedback Run ID map R-prec RR1 P@5 P@10 UMaTiDm.4216.4379.7478.6449.5469 UMaTNDm.4307.4573.7875.6408.5469 UMaTNFb.4706.4972.8207.6980.5816 UMaTDFb.5016.5108.8571.7265.6388 Table 3: Expert finding: submitted runs. 4.4 Conclusion We described a language modeling approach for finding people who are experts on a given topic. It is based on collecting evidence for expertise from multiple sources in a heterogeneous collection, using language modeling to find associations between documents and experts and estimate the degree of association, and finally integrating models to construct rich and effective expert representations. Our approach provides a broad framework for answering questions about experts. It is based on the following two assumptions: 1. We can come up with useful definition(s) of a named entity in the form of a query. This allows us to use any retrieval technique to find associated documents. 2. We can assume that the co-occurrence of terms and named entities is evidence of topical connection. Note that this means we do not distinguish between positive and negative document support. On the other hand, since the model does not have any specific knowledge about what it means to be an expert, it is very general and can be applied to building representations for other named entities such as places, events, organizations. 5 Acknowledgments This work was supported in part by the Center for Intelligent Information Retrieval and in part by the Defense Advanced Research Projects Agency (DARPA), through the Department of the Interior, NBC, Acquisition Services Division, under contract number NBCHD030010. Any opinions, findings and conclusions expressed in this material are the author s and do not necessarily reflect those of the sponsor. References [1] V. R. Carvalho and W. W. Cohen. Learning to extract signature and reply lines from email. In CEAS 04. [2] V. Lavrenko and W. B. Croft. Relevance based language models. In SIGIR 01. [3] D. Metzler and W. B. Croft. A markov random field model for term dependencies. In SIGIR 05. [4] P. Ogilvie and J. Callan. Experiments with language models for known-item finding of e-mail messages. In TREC 05. [5] T. Strohman, D. Metzler, H. Turtle, and W. B. Croft. Indri: A language model-based search engine for complex queries. Technical report, UMass, Amherst, 2005.