CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Similar documents
Information Retrieval. (M&S Ch 15)

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

CSE 494: Information Retrieval, Mining and Integration on the Internet

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Multimedia Information Systems

Relevance of a Document to a Query

CISC689/ Information Retrieval Midterm Exam

Midterm Exam Search Engines ( / ) October 20, 2015

Introduction to Information Retrieval

Instructor: Stefan Savev

Introduction to Information Retrieval

CMPSCI 646, Information Retrieval (Fall 2003)

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Where Should the Bugs Be Fixed?

Math 355: Linear Algebra: Midterm 1 Colin Carroll June 25, 2011

Search Results Clustering in Polish: Evaluation of Carrot

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

60-538: Information Retrieval

Natural Language Processing

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Exam IST 441 Spring 2014

Chapter 6: Information Retrieval and Web Search. An introduction

modern database systems lecture 4 : information retrieval

Exam IST 441 Spring 2011

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CS 6320 Natural Language Processing

Information Retrieval

Impact of Term Weighting Schemes on Document Clustering A Review

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

A Text Retrieval Approach to Recover Links among s and Source Code Classes

Recap: lecture 2 CS276A Information Retrieval

Computer Science 572 Midterm Prof. Horowitz Tuesday, March 12, 2013, 12:30pm 1:45pm

Information Retrieval

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Midterm Exam 2B Answer key

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes )

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Combining Probabilistic Ranking and Latent Semantic Indexing for Feature Identification

Models for Document & Query Representation. Ziawasch Abedjan

CS 533 Information Retrieval Systems

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Web Information Retrieval using WordNet

CS54701: Information Retrieval

AP PHYSICS B 2009 SCORING GUIDELINES

CS47300: Web Information Search and Management

CS105 Introduction to Information Retrieval

Information Retrieval

1. The Normal Distribution, continued

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval

Information Retrieval

Glossary. ASCII: Standard binary codes to represent occidental characters in one byte.

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Exam Marco Kuhlmann. This exam consists of three parts:

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Indexing and Query Processing. What will we cover?

Lecture 5: Information Retrieval using the Vector Space Model

Birkbeck (University of London)

Information Retrieval. Lecture 3 - Index compression. Introduction. Overview. Characterization of an index. Wintersemester 2007

Math 308 Autumn 2016 MIDTERM /18/2016

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

Information Retrieval and Web Search

Chapter 2. Architecture of a Search Engine

June 15, Abstract. 2. Methodology and Considerations. 1. Introduction

Information Retrieval

Chapter 4. Processing Text

COMP6237 Data Mining Searching and Ranking

Latent Semantic Indexing

Information Retrieval. Information Retrieval and Web Search

Modern Information Retrieval

Information Retrieval

ResPubliQA 2010

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

CHAPTER 5 OPTIMAL CLUSTER-BASED RETRIEVAL

Midterm 2 Solutions. CS70 Discrete Mathematics and Probability Theory, Spring 2009

Mining Web Data. Lijun Zhang

Information Retrieval CSCI

(Refer Slide Time: 01:00)

vector space retrieval many slides courtesy James Amherst

Midterm 2. CMSC 430 Introduction to Compilers Spring Instructions Total 100. Name: April 18, 2012

EE 122 Fall st Midterm. Professor: Lai Stoica

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Digital Libraries: Language Technologies

Web Spam Challenge 2008

Assignment 1. Assignment 2. Relevance. Performance Evaluation. Retrieval System Evaluation. Evaluate an IR system

Similarity search in multimedia databases

Query-Driven Document Partitioning and Collection Selection

Informatics 43 Introduction to Software Engineering Final Exam Spring, 2015

Automatically Generating Queries for Prior Art Search

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Measures of Central Tendency

Transcription:

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014 Name: You may consult your notes and/or your textbook. This is a 75 minute, in class exam. If there is information missing in any of the question formulations, state your assumptions and proceed with your answer. If you decide to quote from the textbook, give the page number of your quotation. Answers written in your own words may carry more weight than quotations from the textbook. The exam questions total 100 points.

1

(30 points) The following True/False questions are worth 2 points each. Please circle either T or F. 1. T F The vector space model of IR assumes that the order in which terms occur in a document is not important for retrieval. 2. T F Stoplist processing is sometimes used to reduce run-time storage requirements. 3. T F The probabilistic model of retrieval cannot be used unless the probabilities of terms occurring in the documents is known in advance. 4. T F In a retrieval system using the vector space model, stemming tends to decrease recall. 5. T F In a retrieval system that uses character n-grams rather than words, stemming is unnecessary due to the larger term space. 6. T F There is a principle of duality in IR, in the sense that queries can be regarded as just another type of document, and documents can be used as queries. 7. T F Compression of entries in the term-document matrix can be used to reduce run-time storage and execution-time requirements. 8. T F N-grams are useful in processing documents which may have been subject to typos and OCR errors. 9. T F Once the documents in a collection have been indexed, it sometimes makes sense to compress them until they re needed in response to a user s query. 10. T F The entropy of a string is related to the extent to which that string can be compressed. 11. T F Latent Semantic Analysis makes it possible to retrieve relevant documents even if those documents have no terms in common with the query. 12. T F The BM25 ranking formula is based on relaxing some assumptions in the vector space model. 13. T F Document length normalization was intended to avoid the problem of short documents being ranked too highly. 14. T F The concept of a scored corpus refers to a set of documents, a set of queries, and a set of terms and their corresponding weights. 15. T F In general, as documents in a result set are listed in descending order of estimated relevance, precision goes down as recall goes up. 2

Short answer questions A. (10 points) In a term weight calculation such as tf.idf, we sometimes use the logarithm of some quantity instead of the quantity itself. Give two examples of when or why this would be a good idea. B. (10 points) In processing a user query, which would then be used to calculate similarity scores between that query vector and a set of document vectors, we might assume the terms in the query are of equal weight. We might instead use the same term weighting formula in the query as was used in the documents. Describe one advantage and one disadvantage to this approach. 3

C. (10 points) Consider the case of a query term t that is not in the set of indexed terms for a corpus. That is, although the term t may occur in the collection, t is not represented in the vector space created by that collection. Aside from ignoring the term t, how might the vector space representation be adapted to handle this situation? Describe any obvious advantages or disadvantages to this adaptation. D. (10 points) Suppose we have a corpus in which Zipf s Law holds. If the most frequent term occurs one million times, and the next most frequent term occurs 250,000 times, how often should we expect to see the third most frequent term? And the fourth most frequent term? 4

E. (10 points) Calculate the Levenshtein distance between the words rose and rice. To do so, fill in the appropriate (24) blanks in this table: R I C E 0 R O S E F. (10 points) In the evaluation of a question answering system, why might the traditional recall measure be unsatisfactory? Describe a measure that might be used instead, and why it may be more satisfactory. G. (10 points) Suppose we have a large, mostly unknown collection. By inspection of the first million tokens, we find 30,000 distinct terms. About how many terms would we expect to find in the first 100 million tokens? Why? 5