CISC689/ Information Retrieval Midterm Exam

CISC689/489-010 Information Retrieval Midterm Exam You have 2 hours to complete the following four questions. You may use notes and slides. You can use a calculator, but nothing that connects to the internet (no laptops, Blackberries, iphones, etc.). Good luck! 1. Short answer (5 points each) Answer each of the following questions in a few sentences. (a) Why are term frequency and inverse document frequency used so often in document scoring functions? Inverse document frequency: terms that appear in many documents are less likely to be important than terms that appear in few documents, because the more documents a term appears in the more likely the content of those documents are to be unrelated to the term. Therefore greater inverse document frequency indicates greater importance. Term frequency: a term that appears many times in a document is an indicator that the term is important to that document and therefore that the document is more likely to be about that term. Therefore greater term frequency indicates greater importance. (b) How do stopping and stemming reduce the size of an inverted index? Stopping: by eliminating the terms with very long inverted lists. Stemming: by reducing the number of inverted lists by consolidating the lists for two or more terms with the same stem. (c) With 5, 000 documents and 10, 000 unique vocabulary terms, a bit vector index requires 5 10 7 bits of storage. Suppose documents have 200 terms on average. If we added 2, 200 more documents to the collection, roughly how big would the bit vector index become? Use Heaps law with k = 10 and β = 0.5. Heaps law tells us that 10, 000 = 10 (5000 200) 0.5. If we add 2200 documents, we have V = 10 (7200 200) 0.5 = 12, 000 vocabulary terms, so the bit vector index requires 7200 12000 9 10 7 bits. (d) The figure below depicts interpolated precision-recall curves for two search engines that index research articles. There is no difference between the engines except in how they score documents. Imagine you re a scientist looking for all published work on some topic. You don t want to miss any citation. Which engine would you prefer and why? interpolated precision 0.0 0.2 0.4 0.6 0.8 1.0 Engine 1 Engine 2 0.0 0.2 0.4 0.6 0.8 1.0 recall 1

Engine 1 ranks more relevant documents in the top results, but Engine 2 finds all of the relevant documents faster. If there are R relevant documents total, and precision at recall=1 is roughly 0.25 for Engine 2, then I have to go to rank R/0.25 = 4R in Engine 2 to find all of the relevant documents. If precision at recall=1 is roughly 0.01 for Engine 1, then I have to go to rank R/0.01 = 100R in Engine 1 to find all of the relevant documents. That s 25 times more documents I have to look at in Engine 1 s results. Thus I prefer Engine 2. (e) Describe the advantages and disadvantages of language models and vector space models with respect to each other. Vector space models are flexible: you can set term weights to be anything you want and you can easily add additional features. They are easy to understand and implement. The main disadvantage is that there is no formally-motivated way to determine what the term weights should be, and there is a practically-infinite space of possibilities to search in. Language models have a formal motivation in terms of probabilities of terms being sampled from documents that limits the search space of term weights. Language models can incorporate arbitrarily complex features of natural language. The disadvantages are that there are many parameters to estimate (probabilities of features in every document), it is difficult to incorporate non-language features, and there is no explicit model of relevance. (f) You have developed a new method for parsing documents that uses semantic information to decide which sentences to index and which to skip. How would you determine whether your method produces better retrieval results than indexing every sentence? I assume I already have an index that includes every sentence and a query processing engine for that index. Now I will re-parse every document in my collection and re-build the index from scratch. I will use a sample of queries as inputs to both engines, giving me two document rankings for each query one from the original index, one from the index with the new parsing method. Then I will have assessors judge the relevance of documents, and use those relevance judgments to calculate measures like precision, recall, and average precisions. Whichever engine has the highest precision or recall or average precision is the one that was better. (g) What is the primary difference between a signature file index and a bit vector index? How does this difference affect performance (storage space and retrieval performance)? In a bit vector index, every document is represented by a bit vector of length V (the vocabulary size). In a signature file index, documents are represented by bit vectors of length k (a parameter set by the engine developer). In terms of storage space, k can be set so that the index has much lower storage requirements than the bit vector index. In terms of retrieval performance, the smaller k is the more collisions there will be in query processing. This results in more false matches. (h) Suppose you have observed that users click on the second result half as often as the first result, on the third result 1/3rd as often as the first result, and so on (clicks on the ith result = 1/i times clicks on the first result). How would you modify the discount in DCG to model this behavior? I would set the discount function to 1/i instead of 1/ log(i + 1). 2

2. Indexing (20 points) Sketch pseudo-code for indexing a collection of documents with an inverted file. Be sure to include all the steps you have performed in the project. It does not have to be exactly correct, but it should cover all the major points of building an inverted index. You do not need to include pseudo-code for stemming and compression, but you should include calls to stemming and compression functions where appropriate. (Be sure to manage your time spent on this problem. Do not spend an hour making sure every detail works right. Focus on including the steps in the right order.) Here s one possibility. Obviously there are many possible answers, though the basic steps do not change. function ParseAndTokenize(D) determine which parts of D are important (according to pre-determined rules) tokenize those parts into a list of tokens T (according to pre-determined rules) return T function Index(C) I = new InvertedIndex for each document D in the collection C T = ParseAndTokenize(D) for each term t in T if t is in an unimportant part of D, skip it else if t is a stop word, skip it else w = stem(t) if (!I.hasTerm(w)) I[w] = new InvertedList updatelist(i[w], D) end if end for end for I.write function updatelist(l, D) // l is a class that keeps track of: // most recently added document (l.lastdoc) // term frequency in most recently added document (l.tf) // collection term frequency (l.ctf) // document frequency (l.df) // compressed inverted list (l.list) if l.lastdoc == D l.tf++ l.ctf++ else l.list.push(compress(l.tf)) dgap = D - l.lastdoc l.list.push(compress(dgap)) l.tf = 1 l.lastdoc = D l.df++ 3

3. Retrieval (20 points) Our discussion of inverted lists in class has generally assumed that they store document IDs, term frequencies, and a few other things (collection term frequencies, document frequencies, positions, for example). In general, inverted lists do not need to store those features, and can actually store much more complex data about documents and terms. This data can aid rapid query processing. Let us assume that we have decided to use language model document scoring with Jelinek-Mercer smoothing, with parameter λ fixed at λ = 0.2. Recall the Jelinek-Mercer scoring function: score(d i, Q) = P (Q D i ) = log P (t D i ) = ( log (1 λ) tf t,d i + λ ctf ) t D i C t Q t Q where tf t,di is the number of times t appears in D i, D i is the length of D i, ctf t is the number of times t appears in all documents, and C is the total number of term occurrences in the collection. Assume we will never need to calculate BIM or BM25 or any other scoring function. Further assume that we will use term-at-a-time processing for queries. This question is about fast, efficient document scoring using inverted lists. (a) It is possible to calculate the Jelinek-Mercer document score during query processing using only additions no division, no multiplication, and no logarithms would be needed. What information about documents and terms (not including tf and ctf) would you need to store in an inverted list to be able to do so? Describe what the uncompressed list would look like and what data types would be needed to store the information in it. During indexing, we can calculate P (t D i ) for every term in every document. We may then store inverted lists that look like this: t j ( 0.2 ctf t C, ( D 1, log ( 0.8 tf t,d 1 + 0.2 ctf t D 1 C )) ( (, D 2, log 0.8 tf t,d 2 + 0.2 ctf )) ) t,... D 2 C Instead of document frequency, we store P (t j C) = 0.2 ctft C, and instead of term frequencies we store P (t j D i ) directly. Then we can score documents just by adding up the pre-computed term probabilities for the query terms. Storing this would require floating point numbers rather than integers. (b) Given part a, what are three ways you could further improve efficiency of term-at-a-time query processing? Explain in detail how each one improves speed. i. Sort each inverted list in decreasing order of P (t j D i ). I can do this during indexing. Then during term-at-a-time processing, I will always be focusing on the highest-scoring documents. I can stop processing a list after the top k highest scoring for more efficient processing. ii. Sort all the inverted lists for the query terms in increasing order of length. This ensures that I will process the shortest lists first. The shortest lists are the ones with the lowest document frequency, and therefore the ones that contribute the most to document scores. Now I can do very simple score thresholding: if P (t j D i ) is less than the kth lowest score so far, I can skip the rest of the inverted list. iii. Skip lists, caching, other answers acceptable. (c) Now suppose we don t want to fix λ; we want to have the freedom to change λ without recreating the inverted lists. Is the statement in part a still true? Why or why not? No. To be able to use only additions, we had to pre-compute P (t D i ) using a particular value of λ. If we want to be able to change λ we could not pre-compute and store P (t D i ). 4

(d) Would any of the compression algorithms we discussed (short byte, restricted-length variable byte, general variable byte) work without modification to effectively compress your lists in part a? Why or why not? No, because we had store floating point numbers. All of those compression methods were for integers. (e) (Extra credit) If your answer to part d is no, can you come up with an alternative compression algorithm (you may describe it in general terms)? There are many possibilities here. One is to use the definition of float or double data types to modify v-byte coding appropriately. Another is to map floats to ints using some pre-determined scale where high probabilities get high integer values and low probabilities get low integer values. This might be a little lossy, but it would also speed up query processing even more. 5

4. Evaluation (20 points) The probability ranking principle is one of the fundamental tenets of information retrieval. In this problem we will (partially) prove it. (a) State the probability ranking principle in your own words. Why is it important? The probability ranking principle says that the optimal ranking of documents is in decreasing order of probability of relevance. It is important because it gives a guideline for optimizing retrieval engines: the better we can estimate probability of relevance of documents, the better our retrieval engine will be. (b) For a given rank k, let R k be the number of relevant documents that appear from ranks 1 to k. If our engine gives us the probability that document D i is relevant to query Q, i.e. P (R D i, Q), the expected value of R k is defined as follows: E[R k ] = k P (R D i, Q) (1) i=1 Show, or informally prove by contradiction, that for every value of k, E[R k ] is maximized by ranking documents in decreasing order of probability. The proof is by contradiction. Suppose we have a ranking of documents that is not in decreasing order of probability. That means there are documents D k, D j such that D k is ranked above D j but P (R D k, Q) < P (R D j, Q) (i.e. D k is less likely to be relevant than D j ). Then E[R k ] = k P (R D i, Q) = P (R D 1, Q) + P (R D 2, Q) + + P (R D k, Q) i=1 < P (R D 1, Q) + P (R D 2, Q) + + P (R D j, Q). The expectation is less than it would have been if we had put D j at rank k instead of D k, and therefore it is not maximized. Therefore if documents are not ranked in decreasing order of probability, there is some rank k for which E[R k ] is not maximized. (c) Use Eq. 1 to define expressions for the expected value of precision and the expected value of recall. You may assume that there are R relevant documents total. E[precision at k] = E[R k] k E[recall at k] = E[R k] R (d) Use part b to show that your expressions from part c are maximized by ranking documents in decreasing order of probability. It follows directly from part b. If E[R k ] is not maximized, then the expectations of precision and recall cannot be maximized either, since E[R k ] is in the numerator of both expressions. (e) (Extra credit) Part d gives an optimal way to rank documents assuming that we have a way to estimate relevance probabilities P (R D i, Q). In class we talked about models that do that BIM and BM25 are two examples. Suppose that instead of using term statistics, like BIM and BM25 do, we use user clicks to estimate probabilities, so that the documents with the most clicks get the highest probabilities of relevance. Documents could then be ranked in decreasing order of clicks. Have we created the perfect search engine? Why or why not? No, because people tend to click on documents just because they are highly ranked (see part h of problem 1). If the system wasn t perfect in the first place, then we will just be reinforcing its imperfections. 6