CISC689/ Information Retrieval Midterm Exam

Similar documents
CMPSCI 646, Information Retrieval (Fall 2003)

Theory of Computations Spring 2016 Practice Final Exam Solutions

Developing MapReduce Programs

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

(Refer Slide Time: 01.26)

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Math 308 Autumn 2016 MIDTERM /18/2016

Query Answering Using Inverted Indexes

CSE 494: Information Retrieval, Mining and Integration on the Internet

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Midterm Exam Search Engines ( / ) October 20, 2015

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

CSE332 Summer 2010: Final Exam

Document indexing, similarities and retrieval in large scale text collections

Practice Problems for the Final

CSE 240A Midterm Exam

Multimedia Information Systems

Static Pruning of Terms In Inverted Files

Information Retrieval. (M&S Ch 15)

Lecture 5: Information Retrieval using the Vector Space Model

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS 415 Midterm Exam Spring 2002

Chapter III.2: Basic ranking & evaluation measures

ALGEBRA Sec. 5 IDENTITY AXIOMS. MathHands.com. IDENTITY AXIOMS: Identities

p x i 1 i n x, y, z = 2 x 3 y 5 z

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming

Information Retrieval

Query Processing and Alternative Search Structures. Indexing common words

Information Retrieval

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Retrieval Evaluation. Hongning Wang

CS 373: Combinatorial Algorithms, Spring 1999

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Architecture and Implementation of Database Systems (Summer 2018)

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes )

Sample questions with solutions Ekaterina Kochmar

Lecture 15. Error-free variable length schemes: Shannon-Fano code

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)

Text Retrieval an introduction

Lecture Programming in C++ PART 1. By Assistant Professor Dr. Ali Kattan

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

CMPSCI 145 MIDTERM #2 SOLUTION KEY SPRING 2015 April 3, 2015 Professor William T. Verts

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

Recap: lecture 2 CS276A Information Retrieval

Efficient query processing

CPSC 121 Some Sample Questions for the Final Exam Tuesday, April 15, 2014, 8:30AM

Information Retrieval

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

740: Computer Architecture, Fall 2013 Midterm I

Hashing for searching

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators)

Math 355: Linear Algebra: Midterm 1 Colin Carroll June 25, 2011

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

Recitation 4: Elimination algorithm, reconstituted graph, triangulation

CMSC411 Fall 2013 Midterm 1

Chapter 8. Evaluating Search Engine

Natural Language Processing

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. ECE 345 Algorithms and Data Structures Fall 2012

CSE 373 Winter 2009: Midterm #1 (closed book, closed notes, NO calculators allowed)

CMSC330 Fall 2016 Midterm #2 2:00pm/3:30pm

Do not turn this page until you have received the signal to start. In the meantime, please read the instructions below carefully.

CS54701: Information Retrieval

Corso di Biblioteche Digitali

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Data-Intensive Distributed Computing

CS 106B Lecture 26: Esoteric Data Structures: Skip Lists and Bloom Filters

Information Retrieval

EECS 470 Midterm Exam Winter 2008 answers

2. CONNECTIVITY Connectivity

CS 415 Midterm Exam Fall 2003

Information Retrieval

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I

EECS 3214 Midterm Test Winter 2017 March 2, 2017 Instructor: S. Datta. 3. You have 120 minutes to complete the exam. Use your time judiciously.

Theory of Computations Spring 2016 Practice Final

CSE 332, Spring 2010, Midterm Examination 30 April 2010

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

Writeup for first project of CMSC 420: Data Structures Section 0102, Summer Theme: Threaded AVL Trees

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval and Organisation

Propositional Logic Formal Syntax and Semantics. Computability and Logic

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

(Refer Slide Time 3:31)

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

Real-time Text Queries with Tunable Term Pair Indexes

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. ECE 345 Algorithms and Data Structures Fall 2010

1/15 2/19 3/23 4/28 5/12 6/23 Total/120 % Please do not write in the spaces above.

CSE373 Fall 2013, Final Examination December 10, 2013 Please do not turn the page until the bell rings.

CMSC 336: Type Systems for Programming Languages Lecture 5: Simply Typed Lambda Calculus Acar & Ahmed January 24, 2008

Transcription:

CISC689/489-010 Information Retrieval Midterm Exam You have 2 hours to complete the following four questions. You may use notes and slides. You can use a calculator, but nothing that connects to the internet (no laptops, Blackberries, iphones, etc.). Good luck! 1. Short answer (5 points each) Answer each of the following questions in a few sentences. (a) Why are term frequency and inverse document frequency used so often in document scoring functions? Inverse document frequency: terms that appear in many documents are less likely to be important than terms that appear in few documents, because the more documents a term appears in the more likely the content of those documents are to be unrelated to the term. Therefore greater inverse document frequency indicates greater importance. Term frequency: a term that appears many times in a document is an indicator that the term is important to that document and therefore that the document is more likely to be about that term. Therefore greater term frequency indicates greater importance. (b) How do stopping and stemming reduce the size of an inverted index? Stopping: by eliminating the terms with very long inverted lists. Stemming: by reducing the number of inverted lists by consolidating the lists for two or more terms with the same stem. (c) With 5, 000 documents and 10, 000 unique vocabulary terms, a bit vector index requires 5 10 7 bits of storage. Suppose documents have 200 terms on average. If we added 2, 200 more documents to the collection, roughly how big would the bit vector index become? Use Heaps law with k = 10 and β = 0.5. Heaps law tells us that 10, 000 = 10 (5000 200) 0.5. If we add 2200 documents, we have V = 10 (7200 200) 0.5 = 12, 000 vocabulary terms, so the bit vector index requires 7200 12000 9 10 7 bits. (d) The figure below depicts interpolated precision-recall curves for two search engines that index research articles. There is no difference between the engines except in how they score documents. Imagine you re a scientist looking for all published work on some topic. You don t want to miss any citation. Which engine would you prefer and why? interpolated precision 0.0 0.2 0.4 0.6 0.8 1.0 Engine 1 Engine 2 0.0 0.2 0.4 0.6 0.8 1.0 recall 1

Engine 1 ranks more relevant documents in the top results, but Engine 2 finds all of the relevant documents faster. If there are R relevant documents total, and precision at recall=1 is roughly 0.25 for Engine 2, then I have to go to rank R/0.25 = 4R in Engine 2 to find all of the relevant documents. If precision at recall=1 is roughly 0.01 for Engine 1, then I have to go to rank R/0.01 = 100R in Engine 1 to find all of the relevant documents. That s 25 times more documents I have to look at in Engine 1 s results. Thus I prefer Engine 2. (e) Describe the advantages and disadvantages of language models and vector space models with respect to each other. Vector space models are flexible: you can set term weights to be anything you want and you can easily add additional features. They are easy to understand and implement. The main disadvantage is that there is no formally-motivated way to determine what the term weights should be, and there is a practically-infinite space of possibilities to search in. Language models have a formal motivation in terms of probabilities of terms being sampled from documents that limits the search space of term weights. Language models can incorporate arbitrarily complex features of natural language. The disadvantages are that there are many parameters to estimate (probabilities of features in every document), it is difficult to incorporate non-language features, and there is no explicit model of relevance. (f) You have developed a new method for parsing documents that uses semantic information to decide which sentences to index and which to skip. How would you determine whether your method produces better retrieval results than indexing every sentence? I assume I already have an index that includes every sentence and a query processing engine for that index. Now I will re-parse every document in my collection and re-build the index from scratch. I will use a sample of queries as inputs to both engines, giving me two document rankings for each query one from the original index, one from the index with the new parsing method. Then I will have assessors judge the relevance of documents, and use those relevance judgments to calculate measures like precision, recall, and average precisions. Whichever engine has the highest precision or recall or average precision is the one that was better. (g) What is the primary difference between a signature file index and a bit vector index? How does this difference affect performance (storage space and retrieval performance)? In a bit vector index, every document is represented by a bit vector of length V (the vocabulary size). In a signature file index, documents are represented by bit vectors of length k (a parameter set by the engine developer). In terms of storage space, k can be set so that the index has much lower storage requirements than the bit vector index. In terms of retrieval performance, the smaller k is the more collisions there will be in query processing. This results in more false matches. (h) Suppose you have observed that users click on the second result half as often as the first result, on the third result 1/3rd as often as the first result, and so on (clicks on the ith result = 1/i times clicks on the first result). How would you modify the discount in DCG to model this behavior? I would set the discount function to 1/i instead of 1/ log(i + 1). 2

2. Indexing (20 points) Sketch pseudo-code for indexing a collection of documents with an inverted file. Be sure to include all the steps you have performed in the project. It does not have to be exactly correct, but it should cover all the major points of building an inverted index. You do not need to include pseudo-code for stemming and compression, but you should include calls to stemming and compression functions where appropriate. (Be sure to manage your time spent on this problem. Do not spend an hour making sure every detail works right. Focus on including the steps in the right order.) Here s one possibility. Obviously there are many possible answers, though the basic steps do not change. function ParseAndTokenize(D) determine which parts of D are important (according to pre-determined rules) tokenize those parts into a list of tokens T (according to pre-determined rules) return T function Index(C) I = new InvertedIndex for each document D in the collection C T = ParseAndTokenize(D) for each term t in T if t is in an unimportant part of D, skip it else if t is a stop word, skip it else w = stem(t) if (!I.hasTerm(w)) I[w] = new InvertedList updatelist(i[w], D) end if end for end for I.write function updatelist(l, D) // l is a class that keeps track of: // most recently added document (l.lastdoc) // term frequency in most recently added document (l.tf) // collection term frequency (l.ctf) // document frequency (l.df) // compressed inverted list (l.list) if l.lastdoc == D l.tf++ l.ctf++ else l.list.push(compress(l.tf)) dgap = D - l.lastdoc l.list.push(compress(dgap)) l.tf = 1 l.lastdoc = D l.df++ 3

3. Retrieval (20 points) Our discussion of inverted lists in class has generally assumed that they store document IDs, term frequencies, and a few other things (collection term frequencies, document frequencies, positions, for example). In general, inverted lists do not need to store those features, and can actually store much more complex data about documents and terms. This data can aid rapid query processing. Let us assume that we have decided to use language model document scoring with Jelinek-Mercer smoothing, with parameter λ fixed at λ = 0.2. Recall the Jelinek-Mercer scoring function: score(d i, Q) = P (Q D i ) = log P (t D i ) = ( log (1 λ) tf t,d i + λ ctf ) t D i C t Q t Q where tf t,di is the number of times t appears in D i, D i is the length of D i, ctf t is the number of times t appears in all documents, and C is the total number of term occurrences in the collection. Assume we will never need to calculate BIM or BM25 or any other scoring function. Further assume that we will use term-at-a-time processing for queries. This question is about fast, efficient document scoring using inverted lists. (a) It is possible to calculate the Jelinek-Mercer document score during query processing using only additions no division, no multiplication, and no logarithms would be needed. What information about documents and terms (not including tf and ctf) would you need to store in an inverted list to be able to do so? Describe what the uncompressed list would look like and what data types would be needed to store the information in it. During indexing, we can calculate P (t D i ) for every term in every document. We may then store inverted lists that look like this: t j ( 0.2 ctf t C, ( D 1, log ( 0.8 tf t,d 1 + 0.2 ctf t D 1 C )) ( (, D 2, log 0.8 tf t,d 2 + 0.2 ctf )) ) t,... D 2 C Instead of document frequency, we store P (t j C) = 0.2 ctft C, and instead of term frequencies we store P (t j D i ) directly. Then we can score documents just by adding up the pre-computed term probabilities for the query terms. Storing this would require floating point numbers rather than integers. (b) Given part a, what are three ways you could further improve efficiency of term-at-a-time query processing? Explain in detail how each one improves speed. i. Sort each inverted list in decreasing order of P (t j D i ). I can do this during indexing. Then during term-at-a-time processing, I will always be focusing on the highest-scoring documents. I can stop processing a list after the top k highest scoring for more efficient processing. ii. Sort all the inverted lists for the query terms in increasing order of length. This ensures that I will process the shortest lists first. The shortest lists are the ones with the lowest document frequency, and therefore the ones that contribute the most to document scores. Now I can do very simple score thresholding: if P (t j D i ) is less than the kth lowest score so far, I can skip the rest of the inverted list. iii. Skip lists, caching, other answers acceptable. (c) Now suppose we don t want to fix λ; we want to have the freedom to change λ without recreating the inverted lists. Is the statement in part a still true? Why or why not? No. To be able to use only additions, we had to pre-compute P (t D i ) using a particular value of λ. If we want to be able to change λ we could not pre-compute and store P (t D i ). 4

(d) Would any of the compression algorithms we discussed (short byte, restricted-length variable byte, general variable byte) work without modification to effectively compress your lists in part a? Why or why not? No, because we had store floating point numbers. All of those compression methods were for integers. (e) (Extra credit) If your answer to part d is no, can you come up with an alternative compression algorithm (you may describe it in general terms)? There are many possibilities here. One is to use the definition of float or double data types to modify v-byte coding appropriately. Another is to map floats to ints using some pre-determined scale where high probabilities get high integer values and low probabilities get low integer values. This might be a little lossy, but it would also speed up query processing even more. 5

4. Evaluation (20 points) The probability ranking principle is one of the fundamental tenets of information retrieval. In this problem we will (partially) prove it. (a) State the probability ranking principle in your own words. Why is it important? The probability ranking principle says that the optimal ranking of documents is in decreasing order of probability of relevance. It is important because it gives a guideline for optimizing retrieval engines: the better we can estimate probability of relevance of documents, the better our retrieval engine will be. (b) For a given rank k, let R k be the number of relevant documents that appear from ranks 1 to k. If our engine gives us the probability that document D i is relevant to query Q, i.e. P (R D i, Q), the expected value of R k is defined as follows: E[R k ] = k P (R D i, Q) (1) i=1 Show, or informally prove by contradiction, that for every value of k, E[R k ] is maximized by ranking documents in decreasing order of probability. The proof is by contradiction. Suppose we have a ranking of documents that is not in decreasing order of probability. That means there are documents D k, D j such that D k is ranked above D j but P (R D k, Q) < P (R D j, Q) (i.e. D k is less likely to be relevant than D j ). Then E[R k ] = k P (R D i, Q) = P (R D 1, Q) + P (R D 2, Q) + + P (R D k, Q) i=1 < P (R D 1, Q) + P (R D 2, Q) + + P (R D j, Q). The expectation is less than it would have been if we had put D j at rank k instead of D k, and therefore it is not maximized. Therefore if documents are not ranked in decreasing order of probability, there is some rank k for which E[R k ] is not maximized. (c) Use Eq. 1 to define expressions for the expected value of precision and the expected value of recall. You may assume that there are R relevant documents total. E[precision at k] = E[R k] k E[recall at k] = E[R k] R (d) Use part b to show that your expressions from part c are maximized by ranking documents in decreasing order of probability. It follows directly from part b. If E[R k ] is not maximized, then the expectations of precision and recall cannot be maximized either, since E[R k ] is in the numerator of both expressions. (e) (Extra credit) Part d gives an optimal way to rank documents assuming that we have a way to estimate relevance probabilities P (R D i, Q). In class we talked about models that do that BIM and BM25 are two examples. Suppose that instead of using term statistics, like BIM and BM25 do, we use user clicks to estimate probabilities, so that the documents with the most clicks get the highest probabilities of relevance. Documents could then be ranked in decreasing order of clicks. Have we created the perfect search engine? Why or why not? No, because people tend to click on documents just because they are highly ranked (see part h of problem 1). If the system wasn t perfect in the first place, then we will just be reinforcing its imperfections. 6