CISC689/ Information Retrieval Midterm Exam

Size: px
Start display at page:

Download "CISC689/ Information Retrieval Midterm Exam"

Transcription

1 CISC689/ Information Retrieval Midterm Exam You have 2 hours to complete the following four questions. You may use notes and slides. You can use a calculator, but nothing that connects to the internet (no laptops, Blackberries, iphones, etc.). Good luck! 1. Short answer (5 points each) Answer each of the following questions in a few sentences. (a) Why are term frequency and inverse document frequency used so often in document scoring functions? Inverse document frequency: terms that appear in many documents are less likely to be important than terms that appear in few documents, because the more documents a term appears in the more likely the content of those documents are to be unrelated to the term. Therefore greater inverse document frequency indicates greater importance. Term frequency: a term that appears many times in a document is an indicator that the term is important to that document and therefore that the document is more likely to be about that term. Therefore greater term frequency indicates greater importance. (b) How do stopping and stemming reduce the size of an inverted index? Stopping: by eliminating the terms with very long inverted lists. Stemming: by reducing the number of inverted lists by consolidating the lists for two or more terms with the same stem. (c) With 5, 000 documents and 10, 000 unique vocabulary terms, a bit vector index requires bits of storage. Suppose documents have 200 terms on average. If we added 2, 200 more documents to the collection, roughly how big would the bit vector index become? Use Heaps law with k = 10 and β = 0.5. Heaps law tells us that 10, 000 = 10 ( ) 0.5. If we add 2200 documents, we have V = 10 ( ) 0.5 = 12, 000 vocabulary terms, so the bit vector index requires bits. (d) The figure below depicts interpolated precision-recall curves for two search engines that index research articles. There is no difference between the engines except in how they score documents. Imagine you re a scientist looking for all published work on some topic. You don t want to miss any citation. Which engine would you prefer and why? interpolated precision Engine 1 Engine recall 1

2 Engine 1 ranks more relevant documents in the top results, but Engine 2 finds all of the relevant documents faster. If there are R relevant documents total, and precision at recall=1 is roughly 0.25 for Engine 2, then I have to go to rank R/0.25 = 4R in Engine 2 to find all of the relevant documents. If precision at recall=1 is roughly 0.01 for Engine 1, then I have to go to rank R/0.01 = 100R in Engine 1 to find all of the relevant documents. That s 25 times more documents I have to look at in Engine 1 s results. Thus I prefer Engine 2. (e) Describe the advantages and disadvantages of language models and vector space models with respect to each other. Vector space models are flexible: you can set term weights to be anything you want and you can easily add additional features. They are easy to understand and implement. The main disadvantage is that there is no formally-motivated way to determine what the term weights should be, and there is a practically-infinite space of possibilities to search in. Language models have a formal motivation in terms of probabilities of terms being sampled from documents that limits the search space of term weights. Language models can incorporate arbitrarily complex features of natural language. The disadvantages are that there are many parameters to estimate (probabilities of features in every document), it is difficult to incorporate non-language features, and there is no explicit model of relevance. (f) You have developed a new method for parsing documents that uses semantic information to decide which sentences to index and which to skip. How would you determine whether your method produces better retrieval results than indexing every sentence? I assume I already have an index that includes every sentence and a query processing engine for that index. Now I will re-parse every document in my collection and re-build the index from scratch. I will use a sample of queries as inputs to both engines, giving me two document rankings for each query one from the original index, one from the index with the new parsing method. Then I will have assessors judge the relevance of documents, and use those relevance judgments to calculate measures like precision, recall, and average precisions. Whichever engine has the highest precision or recall or average precision is the one that was better. (g) What is the primary difference between a signature file index and a bit vector index? How does this difference affect performance (storage space and retrieval performance)? In a bit vector index, every document is represented by a bit vector of length V (the vocabulary size). In a signature file index, documents are represented by bit vectors of length k (a parameter set by the engine developer). In terms of storage space, k can be set so that the index has much lower storage requirements than the bit vector index. In terms of retrieval performance, the smaller k is the more collisions there will be in query processing. This results in more false matches. (h) Suppose you have observed that users click on the second result half as often as the first result, on the third result 1/3rd as often as the first result, and so on (clicks on the ith result = 1/i times clicks on the first result). How would you modify the discount in DCG to model this behavior? I would set the discount function to 1/i instead of 1/ log(i + 1). 2

3 2. Indexing (20 points) Sketch pseudo-code for indexing a collection of documents with an inverted file. Be sure to include all the steps you have performed in the project. It does not have to be exactly correct, but it should cover all the major points of building an inverted index. You do not need to include pseudo-code for stemming and compression, but you should include calls to stemming and compression functions where appropriate. (Be sure to manage your time spent on this problem. Do not spend an hour making sure every detail works right. Focus on including the steps in the right order.) Here s one possibility. Obviously there are many possible answers, though the basic steps do not change. function ParseAndTokenize(D) determine which parts of D are important (according to pre-determined rules) tokenize those parts into a list of tokens T (according to pre-determined rules) return T function Index(C) I = new InvertedIndex for each document D in the collection C T = ParseAndTokenize(D) for each term t in T if t is in an unimportant part of D, skip it else if t is a stop word, skip it else w = stem(t) if (!I.hasTerm(w)) I[w] = new InvertedList updatelist(i[w], D) end if end for end for I.write function updatelist(l, D) // l is a class that keeps track of: // most recently added document (l.lastdoc) // term frequency in most recently added document (l.tf) // collection term frequency (l.ctf) // document frequency (l.df) // compressed inverted list (l.list) if l.lastdoc == D l.tf++ l.ctf++ else l.list.push(compress(l.tf)) dgap = D - l.lastdoc l.list.push(compress(dgap)) l.tf = 1 l.lastdoc = D l.df++ 3

4 3. Retrieval (20 points) Our discussion of inverted lists in class has generally assumed that they store document IDs, term frequencies, and a few other things (collection term frequencies, document frequencies, positions, for example). In general, inverted lists do not need to store those features, and can actually store much more complex data about documents and terms. This data can aid rapid query processing. Let us assume that we have decided to use language model document scoring with Jelinek-Mercer smoothing, with parameter λ fixed at λ = 0.2. Recall the Jelinek-Mercer scoring function: score(d i, Q) = P (Q D i ) = log P (t D i ) = ( log (1 λ) tf t,d i + λ ctf ) t D i C t Q t Q where tf t,di is the number of times t appears in D i, D i is the length of D i, ctf t is the number of times t appears in all documents, and C is the total number of term occurrences in the collection. Assume we will never need to calculate BIM or BM25 or any other scoring function. Further assume that we will use term-at-a-time processing for queries. This question is about fast, efficient document scoring using inverted lists. (a) It is possible to calculate the Jelinek-Mercer document score during query processing using only additions no division, no multiplication, and no logarithms would be needed. What information about documents and terms (not including tf and ctf) would you need to store in an inverted list to be able to do so? Describe what the uncompressed list would look like and what data types would be needed to store the information in it. During indexing, we can calculate P (t D i ) for every term in every document. We may then store inverted lists that look like this: t j ( 0.2 ctf t C, ( D 1, log ( 0.8 tf t,d ctf t D 1 C )) ( (, D 2, log 0.8 tf t,d ctf )) ) t,... D 2 C Instead of document frequency, we store P (t j C) = 0.2 ctft C, and instead of term frequencies we store P (t j D i ) directly. Then we can score documents just by adding up the pre-computed term probabilities for the query terms. Storing this would require floating point numbers rather than integers. (b) Given part a, what are three ways you could further improve efficiency of term-at-a-time query processing? Explain in detail how each one improves speed. i. Sort each inverted list in decreasing order of P (t j D i ). I can do this during indexing. Then during term-at-a-time processing, I will always be focusing on the highest-scoring documents. I can stop processing a list after the top k highest scoring for more efficient processing. ii. Sort all the inverted lists for the query terms in increasing order of length. This ensures that I will process the shortest lists first. The shortest lists are the ones with the lowest document frequency, and therefore the ones that contribute the most to document scores. Now I can do very simple score thresholding: if P (t j D i ) is less than the kth lowest score so far, I can skip the rest of the inverted list. iii. Skip lists, caching, other answers acceptable. (c) Now suppose we don t want to fix λ; we want to have the freedom to change λ without recreating the inverted lists. Is the statement in part a still true? Why or why not? No. To be able to use only additions, we had to pre-compute P (t D i ) using a particular value of λ. If we want to be able to change λ we could not pre-compute and store P (t D i ). 4

5 (d) Would any of the compression algorithms we discussed (short byte, restricted-length variable byte, general variable byte) work without modification to effectively compress your lists in part a? Why or why not? No, because we had store floating point numbers. All of those compression methods were for integers. (e) (Extra credit) If your answer to part d is no, can you come up with an alternative compression algorithm (you may describe it in general terms)? There are many possibilities here. One is to use the definition of float or double data types to modify v-byte coding appropriately. Another is to map floats to ints using some pre-determined scale where high probabilities get high integer values and low probabilities get low integer values. This might be a little lossy, but it would also speed up query processing even more. 5

6 4. Evaluation (20 points) The probability ranking principle is one of the fundamental tenets of information retrieval. In this problem we will (partially) prove it. (a) State the probability ranking principle in your own words. Why is it important? The probability ranking principle says that the optimal ranking of documents is in decreasing order of probability of relevance. It is important because it gives a guideline for optimizing retrieval engines: the better we can estimate probability of relevance of documents, the better our retrieval engine will be. (b) For a given rank k, let R k be the number of relevant documents that appear from ranks 1 to k. If our engine gives us the probability that document D i is relevant to query Q, i.e. P (R D i, Q), the expected value of R k is defined as follows: E[R k ] = k P (R D i, Q) (1) i=1 Show, or informally prove by contradiction, that for every value of k, E[R k ] is maximized by ranking documents in decreasing order of probability. The proof is by contradiction. Suppose we have a ranking of documents that is not in decreasing order of probability. That means there are documents D k, D j such that D k is ranked above D j but P (R D k, Q) < P (R D j, Q) (i.e. D k is less likely to be relevant than D j ). Then E[R k ] = k P (R D i, Q) = P (R D 1, Q) + P (R D 2, Q) + + P (R D k, Q) i=1 < P (R D 1, Q) + P (R D 2, Q) + + P (R D j, Q). The expectation is less than it would have been if we had put D j at rank k instead of D k, and therefore it is not maximized. Therefore if documents are not ranked in decreasing order of probability, there is some rank k for which E[R k ] is not maximized. (c) Use Eq. 1 to define expressions for the expected value of precision and the expected value of recall. You may assume that there are R relevant documents total. E[precision at k] = E[R k] k E[recall at k] = E[R k] R (d) Use part b to show that your expressions from part c are maximized by ranking documents in decreasing order of probability. It follows directly from part b. If E[R k ] is not maximized, then the expectations of precision and recall cannot be maximized either, since E[R k ] is in the numerator of both expressions. (e) (Extra credit) Part d gives an optimal way to rank documents assuming that we have a way to estimate relevance probabilities P (R D i, Q). In class we talked about models that do that BIM and BM25 are two examples. Suppose that instead of using term statistics, like BIM and BM25 do, we use user clicks to estimate probabilities, so that the documents with the most clicks get the highest probabilities of relevance. Documents could then be ranked in decreasing order of clicks. Have we created the perfect search engine? Why or why not? No, because people tend to click on documents just because they are highly ranked (see part h of problem 1). If the system wasn t perfect in the first place, then we will just be reinforcing its imperfections. 6

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Theory of Computations Spring 2016 Practice Final Exam Solutions

Theory of Computations Spring 2016 Practice Final Exam Solutions 1 of 8 Theory of Computations Spring 2016 Practice Final Exam Solutions Name: Directions: Answer the questions as well as you can. Partial credit will be given, so show your work where appropriate. Try

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 Name: You may consult your notes and/or your textbook. This is a 75 minute, in class exam. If there is information missing in any of the question

More information

(Refer Slide Time: 01.26)

(Refer Slide Time: 01.26) Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture # 22 Why Sorting? Today we are going to be looking at sorting.

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Math 308 Autumn 2016 MIDTERM /18/2016

Math 308 Autumn 2016 MIDTERM /18/2016 Name: Math 38 Autumn 26 MIDTERM - 2 /8/26 Instructions: The exam is 9 pages long, including this title page. The number of points each problem is worth is listed after the problem number. The exam totals

More information

Query Answering Using Inverted Indexes

Query Answering Using Inverted Indexes Query Answering Using Inverted Indexes Inverted Indexes Query Brutus AND Calpurnia J. Pei: Information Retrieval and Web Search -- Query Answering Using Inverted Indexes 2 Document-at-a-time Evaluation

More information

CSE 494: Information Retrieval, Mining and Integration on the Internet

CSE 494: Information Retrieval, Mining and Integration on the Internet CSE 494: Information Retrieval, Mining and Integration on the Internet Midterm. 18 th Oct 2011 (Instructor: Subbarao Kambhampati) In-class Duration: Duration of the class 1hr 15min (75min) Total points:

More information

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488) Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-

More information

Midterm Exam Search Engines ( / ) October 20, 2015

Midterm Exam Search Engines ( / ) October 20, 2015 Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points

More information

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Evaluation" Evaluation is key to building

More information

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3) CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)" All slides Addison Wesley, Donald Metzler, and Anton Leuski, 2008, 2012! Language Model" Unigram language

More information

CSE332 Summer 2010: Final Exam

CSE332 Summer 2010: Final Exam CSE332 Summer 2010: Final Exam Closed notes, closed book; calculator ok. Read the instructions for each problem carefully before answering. Problems vary in point-values, difficulty and length, so you

More information

Document indexing, similarities and retrieval in large scale text collections

Document indexing, similarities and retrieval in large scale text collections Document indexing, similarities and retrieval in large scale text collections Eric Gaussier Univ. Grenoble Alpes - LIG Eric.Gaussier@imag.fr Eric Gaussier Document indexing, similarities & retrieval 1

More information

Practice Problems for the Final

Practice Problems for the Final ECE-250 Algorithms and Data Structures (Winter 2012) Practice Problems for the Final Disclaimer: Please do keep in mind that this problem set does not reflect the exact topics or the fractions of each

More information

CSE 240A Midterm Exam

CSE 240A Midterm Exam Student ID Page 1 of 7 2011 Fall Professor Steven Swanson CSE 240A Midterm Exam Please write your name at the top of each page This is a close book, closed notes exam. No outside material may be used.

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

Static Pruning of Terms In Inverted Files

Static Pruning of Terms In Inverted Files In Inverted Files Roi Blanco and Álvaro Barreiro IRLab University of A Corunna, Spain 29th European Conference on Information Retrieval, Rome, 2007 Motivation : to reduce inverted files size with lossy

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators)

CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators) Name: Email address: Quiz Section: CSE 332 Autumn 2013: Midterm Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering. We will

More information

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators)

CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators) Name: Email address: Quiz Section: CSE 332 Spring 2013: Midterm Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering. We will

More information

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University CS6200 Information Retrieval David Smith College of Computer and Information Science Northeastern University Indexing Process!2 Indexes Storing document information for faster queries Indexes Index Compression

More information

CS 415 Midterm Exam Spring 2002

CS 415 Midterm Exam Spring 2002 CS 415 Midterm Exam Spring 2002 Name KEY Email Address Student ID # Pledge: This exam is closed note, closed book. Good Luck! Score Fortran Algol 60 Compilation Names, Bindings, Scope Functional Programming

More information

Chapter III.2: Basic ranking & evaluation measures

Chapter III.2: Basic ranking & evaluation measures Chapter III.2: Basic ranking & evaluation measures 1. TF-IDF and vector space model 1.1. Term frequency counting with TF-IDF 1.2. Documents and queries as vectors 2. Evaluating IR results 2.1. Evaluation

More information

ALGEBRA Sec. 5 IDENTITY AXIOMS. MathHands.com. IDENTITY AXIOMS: Identities

ALGEBRA Sec. 5 IDENTITY AXIOMS. MathHands.com. IDENTITY AXIOMS: Identities IDENTITY AXIOMS IDENTITY AXIOMS: Identities It is helpful to recall the definition of a binary operation. As we have stated it, a binary operation is a mixing recipe for mixing two items. We used the color

More information

p x i 1 i n x, y, z = 2 x 3 y 5 z

p x i 1 i n x, y, z = 2 x 3 y 5 z 3 Pairing and encoding functions Our aim in this part of the course is to show that register machines can compute everything that can be computed, and to show that there are things that can t be computed.

More information

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming Fall 2006 University of California, Berkeley College of Engineering Computer Science Division EECS John Kubiatowicz Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming Your

More information

Information Retrieval

Information Retrieval Information Retrieval Natural Language Processing: Lecture 12 30.11.2017 Kairit Sirts Homework 4 things that seemed to work Bidirectional LSTM instead of unidirectional Change LSTM activation to sigmoid

More information

Query Processing and Alternative Search Structures. Indexing common words

Query Processing and Alternative Search Structures. Indexing common words Query Processing and Alternative Search Structures CS 510 Winter 2007 1 Indexing common words What is the indexing overhead for a common term? I.e., does leaving out stopwords help? Consider a word such

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 Information Retrieval Lecture 6: Index Compression 6 Last Time: index construction Sort- based indexing Blocked Sort- Based Indexing Merge sort is effective

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Retrieval Evaluation. Hongning Wang

Retrieval Evaluation. Hongning Wang Retrieval Evaluation Hongning Wang CS@UVa What we have learned so far Indexed corpus Crawler Ranking procedure Research attention Doc Analyzer Doc Rep (Index) Query Rep Feedback (Query) Evaluation User

More information

CS 373: Combinatorial Algorithms, Spring 1999

CS 373: Combinatorial Algorithms, Spring 1999 CS 373: Combinatorial Algorithms, Spring 1999 Final Exam (May 7, 1999) Name: Net ID: Alias: This is a closed-book, closed-notes exam! If you brought anything with you besides writing instruments and your

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Architecture and Implementation of Database Systems (Summer 2018)

Architecture and Implementation of Database Systems (Summer 2018) Jens Teubner Architecture & Implementation of DBMS Summer 2018 1 Architecture and Implementation of Database Systems (Summer 2018) Jens Teubner, DBIS Group jens.teubner@cs.tu-dortmund.de Summer 2018 Jens

More information

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes )

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes ) CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16 In-Class Midterm ( 11:35 AM 12:50 PM : 75 Minutes ) This exam will account for either 15% or 30% of your overall grade depending on your

More information

Sample questions with solutions Ekaterina Kochmar

Sample questions with solutions Ekaterina Kochmar Sample questions with solutions Ekaterina Kochmar May 27, 2017 Question 1 Suppose there is a movie rating website where User 1, User 2 and User 3 post their reviews. User 1 has written 30 positive (5-star

More information

Lecture 15. Error-free variable length schemes: Shannon-Fano code

Lecture 15. Error-free variable length schemes: Shannon-Fano code Lecture 15 Agenda for the lecture Bounds for L(X) Error-free variable length schemes: Shannon-Fano code 15.1 Optimal length nonsingular code While we do not know L(X), it is easy to specify a nonsingular

More information

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators)

CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators) Name: Email address: Quiz Section: CSE 332 Spring 2014: Midterm Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering. We will

More information

Text Retrieval an introduction

Text Retrieval an introduction Text Retrieval an introduction Michalis Vazirgiannis Nov. 2012 Outline Document collection preprocessing Feature Selection Indexing Query processing & Ranking Text representation for Information Retrieval

More information

Lecture Programming in C++ PART 1. By Assistant Professor Dr. Ali Kattan

Lecture Programming in C++ PART 1. By Assistant Professor Dr. Ali Kattan Lecture 08-1 Programming in C++ PART 1 By Assistant Professor Dr. Ali Kattan 1 The Conditional Operator The conditional operator is similar to the if..else statement but has a shorter format. This is useful

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 3 Dictionaries and Tolerant retrieval Chapter 4 Index construction Chapter 5 Index compression Content Dictionary data structures

More information

CMPSCI 145 MIDTERM #2 SOLUTION KEY SPRING 2015 April 3, 2015 Professor William T. Verts

CMPSCI 145 MIDTERM #2 SOLUTION KEY SPRING 2015 April 3, 2015 Professor William T. Verts CMPSCI 145 MIDTERM #2 SOLUTION KEY SPRING 2015 April 3, 2015 Page 1 15 Points Answer 15 of the following problems (1 point each). Answer more than 15 for extra credit. Incorrect or blank answers will

More information

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM

CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM CANDIDATE LINK GENERATION USING SEMANTIC PHEROMONE SWARM Ms.Susan Geethu.D.K 1, Ms. R.Subha 2, Dr.S.Palaniswami 3 1, 2 Assistant Professor 1,2 Department of Computer Science and Engineering, Sri Krishna

More information

Recap: lecture 2 CS276A Information Retrieval

Recap: lecture 2 CS276A Information Retrieval Recap: lecture 2 CS276A Information Retrieval Stemming, tokenization etc. Faster postings merges Phrase queries Lecture 3 This lecture Index compression Space estimation Corpus size for estimates Consider

More information

Efficient query processing

Efficient query processing Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking functions In general, document scoring functions are of the form The BM25 function, is one of the best performing:

More information

CPSC 121 Some Sample Questions for the Final Exam Tuesday, April 15, 2014, 8:30AM

CPSC 121 Some Sample Questions for the Final Exam Tuesday, April 15, 2014, 8:30AM CPSC 121 Some Sample Questions for the Final Exam Tuesday, April 15, 2014, 8:30AM Name: Student ID: Signature: Section (circle one): George Steve Your signature acknowledges your understanding of and agreement

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 05 Index Compression 1 05 Index Compression - Information Retrieval - 05 Index Compression 2 Last lecture index construction Sort-based indexing

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics

More information

740: Computer Architecture, Fall 2013 Midterm I

740: Computer Architecture, Fall 2013 Midterm I Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 Midterm I October 23, 2013 Make sure that your exam has 17 pages and is not missing any sheets, then write your

More information

Hashing for searching

Hashing for searching Hashing for searching Consider searching a database of records on a given key. There are three standard techniques: Searching sequentially start at the first record and look at each record in turn until

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009 1 CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009 Instructions: This is an in class, open note exam. Please use the paper provided to submit your responses. You can include additional

More information

CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators)

CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators) Name: Sample Solution Email address (UWNetID): CSE 332 Winter 2018 Final Exam (closed book, closed notes, no calculators) Instructions: Read the directions for each question carefully before answering.

More information

Math 355: Linear Algebra: Midterm 1 Colin Carroll June 25, 2011

Math 355: Linear Algebra: Midterm 1 Colin Carroll June 25, 2011 Rice University, Summer 20 Math 355: Linear Algebra: Midterm Colin Carroll June 25, 20 I have adhered to the Rice honor code in completing this test. Signature: Name: Date: Time: Please read the following

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 6: Index Compression Paul Ginsparg Cornell University, Ithaca, NY 15 Sep

More information

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS

LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS Department of Computer Science University of Babylon LECTURE NOTES OF ALGORITHMS: DESIGN TECHNIQUES AND ANALYSIS By Faculty of Science for Women( SCIW), University of Babylon, Iraq Samaher@uobabylon.edu.iq

More information

Recitation 4: Elimination algorithm, reconstituted graph, triangulation

Recitation 4: Elimination algorithm, reconstituted graph, triangulation Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Recitation 4: Elimination algorithm, reconstituted graph, triangulation

More information

CMSC411 Fall 2013 Midterm 1

CMSC411 Fall 2013 Midterm 1 CMSC411 Fall 2013 Midterm 1 Name: Instructions You have 75 minutes to take this exam. There are 100 points in this exam, so spend about 45 seconds per point. You do not need to provide a number if you

More information

Chapter 8. Evaluating Search Engine

Chapter 8. Evaluating Search Engine Chapter 8 Evaluating Search Engine Evaluation Evaluation is key to building effective and efficient search engines Measurement usually carried out in controlled laboratory experiments Online testing can

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Language Models Language models are distributions over sentences N gram models are built from local conditional probabilities Language Modeling II Dan Klein UC Berkeley, The

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. ECE 345 Algorithms and Data Structures Fall 2012

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. ECE 345 Algorithms and Data Structures Fall 2012 1 University of Toronto Department of Electrical and Computer Engineering Midterm Examination ECE 345 Algorithms and Data Structures Fall 2012 Print your name and ID number neatly in the space provided

More information

CSE 373 Winter 2009: Midterm #1 (closed book, closed notes, NO calculators allowed)

CSE 373 Winter 2009: Midterm #1 (closed book, closed notes, NO calculators allowed) Name: Email address: CSE 373 Winter 2009: Midterm #1 (closed book, closed notes, NO calculators allowed) Instructions: Read the directions for each question carefully before answering. We may give partial

More information

CMSC330 Fall 2016 Midterm #2 2:00pm/3:30pm

CMSC330 Fall 2016 Midterm #2 2:00pm/3:30pm CMSC330 Fall 2016 Midterm #2 2:00pm/3:30pm Gradescope ID: (Gradescope ID is the First letter of your last name and last 5 digits of your UID) (If you write your name on the test, or your gradescope ID

More information

Do not turn this page until you have received the signal to start. In the meantime, please read the instructions below carefully.

Do not turn this page until you have received the signal to start. In the meantime, please read the instructions below carefully. CSC 165 H1 Term Test 2 / L5101 Fall 2011 Duration: Aids Allowed: 60 minutes none Student Number: Family Name(s): Given Name(s): Do not turn this page until you have received the signal to start. In the

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

Corso di Biblioteche Digitali

Corso di Biblioteche Digitali Corso di Biblioteche Digitali Vittore Casarosa casarosa@isti.cnr.it tel. 050-315 3115 cell. 348-397 2168 Ricevimento dopo la lezione o per appuntamento Valutazione finale 70-75% esame orale 25-30% progetto

More information

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017

CS264: Homework #1. Due by midnight on Thursday, January 19, 2017 CS264: Homework #1 Due by midnight on Thursday, January 19, 2017 Instructions: (1) Form a group of 1-3 students. You should turn in only one write-up for your entire group. See the course site for submission

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

CS 106B Lecture 26: Esoteric Data Structures: Skip Lists and Bloom Filters

CS 106B Lecture 26: Esoteric Data Structures: Skip Lists and Bloom Filters CS 106B Lecture 26: Esoteric Data Structures: Skip Lists and Bloom Filters Monday, August 14, 2017 Programming Abstractions Summer 2017 Stanford University Computer Science Department Lecturer: Chris Gregg

More information

Information Retrieval

Information Retrieval Information Retrieval WS 2016 / 2017 Lecture 2, Tuesday October 25 th, 2016 (Ranking, Evaluation) Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University

More information

EECS 470 Midterm Exam Winter 2008 answers

EECS 470 Midterm Exam Winter 2008 answers EECS 470 Midterm Exam Winter 2008 answers Name: KEY unique name: KEY Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: #Page Points 2 /10

More information

2. CONNECTIVITY Connectivity

2. CONNECTIVITY Connectivity 2. CONNECTIVITY 70 2. Connectivity 2.1. Connectivity. Definition 2.1.1. (1) A path in a graph G = (V, E) is a sequence of vertices v 0, v 1, v 2,..., v n such that {v i 1, v i } is an edge of G for i =

More information

CS 415 Midterm Exam Fall 2003

CS 415 Midterm Exam Fall 2003 CS 415 Midterm Exam Fall 2003 Name KEY Email Address Student ID # Pledge: This exam is closed note, closed book. Questions will be graded on quality of answer. Please supply the best answer you can to

More information

Information Retrieval

Information Retrieval Information Retrieval Suan Lee - Information Retrieval - 06 Scoring, Term Weighting and the Vector Space Model 1 Recap of lecture 5 Collection and vocabulary statistics: Heaps and Zipf s laws Dictionary

More information

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I

740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I Instructions: Full Name: Andrew ID (print clearly!): 740: Computer Architecture, Fall 2013 SOLUTIONS TO Midterm I October 23, 2013 Make sure that your exam has 15 pages and is not missing any sheets, then

More information

EECS 3214 Midterm Test Winter 2017 March 2, 2017 Instructor: S. Datta. 3. You have 120 minutes to complete the exam. Use your time judiciously.

EECS 3214 Midterm Test Winter 2017 March 2, 2017 Instructor: S. Datta. 3. You have 120 minutes to complete the exam. Use your time judiciously. EECS 3214 Midterm Test Winter 2017 March 2, 2017 Instructor: S. Datta Name (LAST, FIRST): Student number: Instructions: 1. If you have not done so, put away all books, papers, and electronic communication

More information

Theory of Computations Spring 2016 Practice Final

Theory of Computations Spring 2016 Practice Final 1 of 6 Theory of Computations Spring 2016 Practice Final 1. True/False questions: For each part, circle either True or False. (23 points: 1 points each) a. A TM can compute anything a desktop PC can, although

More information

CSE 332, Spring 2010, Midterm Examination 30 April 2010

CSE 332, Spring 2010, Midterm Examination 30 April 2010 CSE 332, Spring 2010, Midterm Examination 30 April 2010 Please do not turn the page until the bell rings. Rules: The exam is closed-book, closed-note. You may use a calculator for basic arithmetic only.

More information

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables

CITS2200 Data Structures and Algorithms. Topic 15. Hash Tables CITS2200 Data Structures and Algorithms Topic 15 Hash Tables Introduction to hashing basic ideas Hash functions properties, 2-universal functions, hashing non-integers Collision resolution bucketing and

More information

Writeup for first project of CMSC 420: Data Structures Section 0102, Summer Theme: Threaded AVL Trees

Writeup for first project of CMSC 420: Data Structures Section 0102, Summer Theme: Threaded AVL Trees Writeup for first project of CMSC 420: Data Structures Section 0102, Summer 2017 Theme: Threaded AVL Trees Handout date: 06-01 On-time deadline: 06-09, 11:59pm Late deadline (30% penalty): 06-11, 11:59pm

More information

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016 Advanced Topics in Information Retrieval Learning to Rank Vinay Setty vsetty@mpi-inf.mpg.de Jannik Strötgen jannik.stroetgen@mpi-inf.mpg.de ATIR July 14, 2016 Before we start oral exams July 28, the full

More information

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes CS630 Representing and Accessing Digital Information Information Retrieval: Retrieval Models Information Retrieval Basics Data Structures and Access Indexing and Preprocessing Retrieval Models Thorsten

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Dell Zhang Birkbeck, University of London 2015/16 IR Chapter 04 Index Construction Hardware In this chapter we will look at how to construct an inverted index Many

More information

Propositional Logic Formal Syntax and Semantics. Computability and Logic

Propositional Logic Formal Syntax and Semantics. Computability and Logic Propositional Logic Formal Syntax and Semantics Computability and Logic Syntax and Semantics Syntax: The study of how expressions are structured (think: grammar) Semantics: The study of the relationship

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Fundamentals of Database Systems Prof. Arnab Bhattacharya Department of Computer Science and Engineering Indian Institute of Technology, Kanpur Lecture - 18 Database Indexing: Hashing We will start on

More information

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm Name: Student Id Number: 1. This is a closed book exam. 2. Please answer all questions. 3. There are a total of 40 questions.

More information

(Refer Slide Time 3:31)

(Refer Slide Time 3:31) Digital Circuits and Systems Prof. S. Srinivasan Department of Electrical Engineering Indian Institute of Technology Madras Lecture - 5 Logic Simplification In the last lecture we talked about logic functions

More information

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 Prof. John Park Based on slides from previous iterations of this course Today s Topics Overview Uses and motivations of hash tables Major concerns with hash

More information

Real-time Text Queries with Tunable Term Pair Indexes

Real-time Text Queries with Tunable Term Pair Indexes Real-time Text Queries with Tunable Term Pair Indexes Andreas Broschart Ralf Schenkel MPI I 2010 5-006 October 2010 Authors Addresses Andreas Broschart Max-Planck-Institut für Informatik und Universität

More information

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. ECE 345 Algorithms and Data Structures Fall 2010

University of Toronto Department of Electrical and Computer Engineering. Midterm Examination. ECE 345 Algorithms and Data Structures Fall 2010 University of Toronto Department of Electrical and Computer Engineering Midterm Examination ECE 345 Algorithms and Data Structures Fall 2010 Print your name and ID number neatly in the space provided below;

More information

1/15 2/19 3/23 4/28 5/12 6/23 Total/120 % Please do not write in the spaces above.

1/15 2/19 3/23 4/28 5/12 6/23 Total/120 % Please do not write in the spaces above. 1/15 2/19 3/23 4/28 5/12 6/23 Total/120 % Please do not write in the spaces above. Directions: You have 50 minutes in which to complete this exam. Please make sure that you read through this entire exam

More information

CSE373 Fall 2013, Final Examination December 10, 2013 Please do not turn the page until the bell rings.

CSE373 Fall 2013, Final Examination December 10, 2013 Please do not turn the page until the bell rings. CSE373 Fall 2013, Final Examination December 10, 2013 Please do not turn the page until the bell rings. Rules: The exam is closed-book, closed-note, closed calculator, closed electronics. Please stop promptly

More information

CMSC 336: Type Systems for Programming Languages Lecture 5: Simply Typed Lambda Calculus Acar & Ahmed January 24, 2008

CMSC 336: Type Systems for Programming Languages Lecture 5: Simply Typed Lambda Calculus Acar & Ahmed January 24, 2008 CMSC 336: Type Systems for Programming Languages Lecture 5: Simply Typed Lambda Calculus Acar & Ahmed January 24, 2008 Contents 1 Solution to the Exercise 1 1.1 Semantics for lambda calculus.......................

More information