Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques

Size: px

Start display at page:

Download "Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks. Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniques"

Ezra Cannon
5 years ago
Views:

1 Text Retrieval Readings Introduction Manning Chapter: Text Retrieval (Selections) Text Retrieval Tasks Vorhees & Harman (Bulkpack) Evaluation The Vector Space Model Advanced Techniues 1 2 Text Retrieval: Overview Text Retrieval Tasks Goal: Use a collection of documents to generate a response to a uery. Prototypical example: Web search engine Ad-hoc retrieval Given: Large fixed document set, Goal: Find documents that answer ueries. Filtering Given: Fixed uery Goal: Decide whether documents answer that uery. Question-Answering Given: Large fixed document set Goal: Find short passages (~ 1 sentence) that answer uestions. 3 4

2 Text Retrieval Tasks: Classification There are many different kinds of text retrieval. Variables for Classification: Is the uery known ahead of time? Is the corpus known ahead of time? What kind of uery is used? What kind of corpus is used? What kind of response should be generated? Text Retrieval Tasks: Classification (Cont'd) Example classifications: Task Known ahead of time? Corpus Query Response Ad Hoc Corpus Short Medium Document Question Answering Corpus Short Sentence Filtering Query Long Document Cross Language Corpus Multilingual Short Document Speech Corpus Speech Short Medium Document 5 6 TREC Evaluation Text REtrieval Conference Annual workshop to foster reasearch in text retrieval 50+ groups compete for high performance. 16 Tracks = types of text retrieval 7 tracks were active in TREC-9 Ad Hoc Database Merging High Precision Routing Filtering Very Large Corpus Interactive Chinese Query Spanish NLP Cross Language Confusion Speech Web Precision/Recall: Review Target Target Selected True positive False positive Selected False negative True negative Precision = What proportion of selected items are correct? Recall = tp tp fn tp tp fp What proportion of target items are selected? 7 8

3 Precision/Recall Graphs Summary Measures Text retrieval generates multiple responses Consider the first n responses for various n Uninterpolated average precision P = positions at which we got a true positive n=0: precision=100%; recall=0% n=n: precision=0%; recall=100% Graph precision vs. recall at various cutoffs Precision 100% 50% average precision = Avg(precision at cutoff p) F Measure Weighted harmonic mean of precision and recall. 1 α 1 1α 1 P R p P 0% 0% 20% 40% 60% 80% 100% Recall 9 10 Test Set Construction Pooling Which documents are relevant? Need human evaluation New evaluation for each uery Need a complete set of relevant documents Can't find recall without a complete set TREC: Construct a test set that is "approximately correct" Assume that all releveant documents are returned by at least one text retrieval system Manually evaluate the top 100 documents returned by each system. But there are too many documents to evaluate! No bias against any competing system. Small bias against systems that did not compete

4 The Vector Space Model Represent each document as a sparse vector x Each dimension corresponds to a single vocabulary item: 1 if the vocabulary item is in the document 0 if the vocabulary item is not in the document Comparing Documents Two documents are similar if they have similar term vectors. d = document multisetof words V = w 1, w 2,...,w n = vocabulary w i = word x = =,..., x n 1:if w d i 0:if w i Comparing Documents & Queries Comparing Term Vectors Treat the uery as a very small document. Construct a vector representation of the uery. What makes term vectors similar? Attempt 1: Term vectors are similar if their difference is small. similarity 1 = x 1 Similarity(,) Query vectors are typically shorter than document vectors. 15 Does document length matter? We only care about the relative freuency of each term. 16

5 Comparing Term Vectors Cont'd: The Cosine Measure What makes term vectors similar? Attempt 2: Term vectors are similar if the angle between them is small. similarity 2 = cos cos = x 1 x 1 Similarity(,) Normalizing Term Vectors Cosines are easier to compute if we first normalize all document vectors. x = x x = x n 2 i=1 Similarity(,) similarity 2 = x Term Weighting Some terms are more informative than others. Can we get a better simularity measure if we use different vectors? x = Change our definition of =,..., x n 1:if w d i 0:if w i Term Weighting (Continued) Term Freuency Term freuency: Higher term freuency term is more relevant to the document. Weight a term proportionally to its term freuency? = tf i tf i = numberof occurancesof w i d But a word appearing 3 times as often is not 3 times as relevant: use smoothing. Increase for more informative terms. = 1 log tf i :ifw i d 0:ifw i or = tf i 19 20

6 Term Weighting (Continued) Inverse Document Freuency Latent Semantic Indexing Document Freuency: df i = numberof documentscontainingw i Vector model assumes that term contributions to document similarity are independant. Higher document freuency term is less informative. Many terms are not independant. "star chart" is similar to "astral map" Weight a term inversely to its document freuency. Project document vectors into a new vector space. = 1 log tf i log N df i :if w i d In the new vector space, terms that are semantically similar are close to each other. 0:if w i "Latent" semantic dimensions (use smoothing) Using NLP Techniues Using NLP Techniues: Words Traditional wisdom: linguistic knowledge does not improve text retrieval performance. Add new terms: Find terms with collocation analysis Especially true for ad-hoc retrieval. Synonym expansion: WordNet But: Make terms more precise: Linguistic knowledge useful for other tasks (e.g., uestion answering, cross-language text retrieval) Clearly a temporary fact: it is impossible to get 100% performance without linguistic knowledge. Word sense disambiguation Merge terms with the same meaning: Stemming Clean up a "messy" corpus: Language modelling 23 Spell-check 24

7 Using NLP Techniues: Syntax Find new terms: Entity extraction Relationship extraction Narrow the search space: Question parsing Determine what type of answer we expect Entity tagging Tag entities with types (e.g., company, person, date) Relationship extraction Search for partial relationships (e.g., "John likes?x") 25

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and