University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Similar documents
University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Exam IST 441 Spring 2014

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Largest Online Community of VU Students

Problem Score Max Score 1 Syntax directed translation & type

Birkbeck (University of London)

Chapter 6: Information Retrieval and Web Search. An introduction

To earn the extra credit, one of the following has to hold true. Please circle and sign.

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Exam IST 441 Spring 2011

An Information Retrieval Approach for Source Code Plagiarism Detection

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

The American University in Cairo Department of Computer Science & Engineering CSCI &09 Dr. KHALIL Exam-I Fall 2011

CSE373 Fall 2013, Final Examination December 10, 2013 Please do not turn the page until the bell rings.

COMP6237 Data Mining Searching and Ranking

Mining Web Data. Lijun Zhang

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CMPSCI 646, Information Retrieval (Fall 2003)

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

INFS 2150 (Section A) Fall 2018

CS 245 Midterm Exam Winter 2014

Department of Electrical and Computer Engineering University of Wisconsin Madison. Fall Midterm Examination CLOSED BOOK

CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008

EECS 3214 Final Exam Winter 2017 April 19, 2017 Instructor: S. Datta. 3. You have 180 minutes to complete the exam. Use your time judiciously.

Midterm Exam Search Engines ( / ) October 20, 2015

THE WEB SEARCH ENGINE

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

CSE 123: Computer Networks Fall Quarter, 2013 MIDTERM EXAM

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

CS ) PROGRAMMING ASSIGNMENT 11:00 PM 11:00 PM

CS 111X - Fall Test 1

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Exam IST 441 Spring 2013

CS371R: Final Exam Dec. 18, 2017

CMSC330 Spring 2014 Midterm #2. Score 1 OCaml types & type

Computer Science 572 Midterm Prof. Horowitz Tuesday, March 12, 2013, 12:30pm 1:45pm

CS 245 Midterm Exam Solution Winter 2015

Programming Standards: You must conform to good programming/documentation standards. Some specifics:

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

CSE 131 Introduction to Computer Science Fall Exam I

Read this before starting!

Data Analytics. Qualification Exam, May 18, am 12noon

Part I: Data Mining Foundations

Link Analysis. Hongning Wang

cs173: Programming Languages Final Exam

CPSC 340: Machine Learning and Data Mining. Ranking Fall 2016

Mining Web Data. Lijun Zhang

CS 241 Data Organization using C

UNIT-4 Black Box & White Box Testing

10-701/15-781, Fall 2006, Final

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Section 001. Read this before starting!

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Spring 2013 CS 122C & CS 222 Midterm Exam (and Comprehensive Exam, Part I) (Max. Points: 100)

IMPORTANT: Circle the last two letters of your class account:

Largest Online Community of VU Students

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

CSE373 Fall 2013, Final Examination December 10, 2013 Please do not turn the page until the bell rings.

The exam begins at 2:40pm and ends at 4:00pm. You must turn your exam in when time is announced or risk not having it accepted.

UNIT-4 Black Box & White Box Testing

Recommender Systems. Collaborative Filtering & Content-Based Recommending

CSE 131 Introduction to Computer Science Fall 2016 Exam I. Print clearly the following information:

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Information Retrieval CS6200. Jesse Anderton College of Computer and Information Science Northeastern University

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Final Exam in Algorithms and Data Structures 1 (1DL210)

Midterm 2, CS186 Prof. Hellerstein, Spring 2012

Blackboard 5 Level One Student Manual

CS150 - Sample Final

(the bubble footer is automatically inserted into this space)

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

15-780: Problem Set #2

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term.

Entity Relationships and Databases

Questions Total Points Score

Title: Artificial Intelligence: an illustration of one approach.

The University of British Columbia Final Examination - December 02, 2014 Mathematics 308. Closed book examination. No calculators.

CSE 494: Information Retrieval, Mining and Integration on the Internet

Prelim 2 Solution. CS 2110, April 26, 2016, 7:30 PM

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Computer Security Spring 2010 Paxson/Wagner HW 4. Due Thursday April 15, 5:00pm

Modern Retrieval Evaluations. Hongning Wang

On my honor I affirm that I have neither given nor received inappropriate aid in the completion of this exercise.

The exam is closed book, closed notes except your one-page cheat sheet.

CS134 Spring 2005 Final Exam Mon. June. 20, 2005 Signature: Question # Out Of Marks Marker Total

On my honor I affirm that I have neither given nor received inappropriate aid in the completion of this exercise.

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Math 454 Final Exam, Fall 2005

Spring 2016 Algorithms Midterm Exam (Show your work to get full credit!)

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Course: Honors AP Computer Science Instructor: Mr. Jason A. Townsend

CS 455 Final Exam Fall 2013 [Bono] December 12, 2013

a f b e c d Figure 1 Figure 2 Figure 3

ENGR 1181 MATLAB 09: For Loops 2

Final Exam. COMP Summer I June 26, points

Transcription:

University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 2:00pm-3:30pm, Tuesday, December 15th Name: ComputingID: This is a closed book and closed notes exam. No electronic aids or cheat sheets are allowed. There are 9 pages, 4 parts of questions (the last part is bonus questions), and 110 total points in this exam. The questions are printed on both sides! Please carefully read the instructions and questions before you answer them. Please pay special attention on your handwriting; if the answers are not recognizable by the instructor, the grading might be inaccurate (NO argument about this after the grading is done). Try to keep your answers as concise as possible; grading is NOT by keyword matching. Total /110

Academic Integrity Agreement I, the undersigned, have neither witnessed nor received any external help while taking this exam. I understand that doing so (and not reporting) is a violation of the University s academic integrity policies, and may result in academic sanctions. Signature: Your exam will not be graded unless the above agreement is signed. 2

1 True/False Questions (12 2 pts) Please choose either True or False for each of the following statements. For the statement you believe it is False, please give your brief explanation of it (you do not need to explain when you believe it is True). Two point for each question. Note: the credit can only be granted if your explanation for the false statement is correct. 1. Depth-first crawling is preferred in vertical search engines. False, and Explain: focused crawling should be applied, since it is not necessary the links will only point to the pages of the same topical domain. 2. Smoothing of a language model is not needed if all the query terms occur in a document. False, and Explain: smoothing has to be applied to all the documents during search, otherwise the query likelihood will not be comparable. Or you can argue unless all the query terms occur in all the documents, otherwise we still need smoothing. 3. Pairwise learning to rank algorithms are preferred over pointwise algorithms because they can directly optimize ranking-related metrics, e.g., MAP or NDCG. False, and Explain: pairwise learning to rank algorithms can only minimize the number of misorder pairs, but this cannot directly optimize ranking-related metrics. 4. In query generation models, we have assumed words in documents are independent. False, and Explain: we do not have such a assumption in query generation model at all, e.g., we can use N-gram language models. 5. Relevant feedback helps a document generation model to improve its estimation for current query and future document. False, and Explain: it helps a document generation model to improve its estimation for future query and current document. 6. PageRank score can be computed at the crawling stage. True False, and Explain: we need to have sufficient observations of the link graph before we can compute the PageRank. 7. Interleaved test is more sensitive than A/B test. True 8. Maximum a posterior estimation method for parameter estimation is problematic when one has only a small amount of observations. False, and Explain: Maximum a posterior estimation method is preferred when one only has a handful observations. 9. Three important heuristics in vector space ranking models are: term frequency, document frequency and query term frequency. False, and Explain: it should be term frequency, document frequency and document length normalization. 3

10. Because of position bias, result clicks can only be treated as implicit feedback. False, and Explain: because we do not know the exact reason why user clicks on the results, we can only treat click as implicit feedback. 11. The current way of human annotation based IR evaluation is biased, since we cannot exhaust the annotation of all relevant documents. False, and Explain: it can largely reveal the relative comparison between ranking algorithms. 12. The initial state of random walk in PageRank algorithm determines the estimation quality of its stationary distribution. False, and Explain: the stationary distribution of PageRank scores is only related to its transition matrix but independent of initial state of random walk. 4

2 Short Answer Questions (32 pts) Most of the following questions can be answered by one or two sentences. Please make your answer concise and to the point. 1. For the queries below, can we still run through the inverted index join in time O(m+n), where m and n are the length of the postings lists for information and retrieval? If not, what can we achieve? (4pts) information AND NOT retrieval Yes, this is the difference between the posting lists of information and retrieval. information OR NOT retrieval No, this essentially has to go over all the documents in the index, i.e., posting lists of information and all the other words except retrieval. 2. How does Zipf s law ensure effective inverted index compression? (4pts) By storing the gap between consecutive document indices, Zipf s law ensures that: 1) for frequent words, there will be many duplicated small gaps; 2) for infrequent words, the posting list is short. Therefore, by encoding the gap with variable codes, redundancy can be exploited for effective compression. 3. List three different types of click heuristics that generate pairwise result preference assertions. With the given click log of a user s search query, write out a pairwise assertion after each one of those heuristics. (6pts) Search log: {d (3) 1, d 2, d (1) 3, d (2) 4, d 5, d (4) 6, d 7, d 8, d 9, d 10 }, where the subscript indicates document id, and ( ) in the superscript represents click order. Click > Skip Above, d (2) 4 > d 2 Last Click > Skip Above, d (4) 6 > d 2 Click > Skip Previous, d (1) 3 > d 2 4. List three different components/algorithms in a typical information retrieval system where sequential browsing has been assumed. (6pts) Search result ranking, i.e., rank document by descending order of probability of being relevant Search evaluation, i.e., higher weights are given documents ranked at higher position Click modeling, examination event is assumed to be sequential from top to bottom 5. List three different behavior-based ranking metrics, that measure a retrieval system s search result quality based on users interactive behaviors. Please also indicate their correlation with the search quality, e.g., an increased value corresponds improve ranking performance, or other way round. (6pts) abandonment rate, increased value indicates worse ranking performance 5

reformulation rate, increased value indicates worse ranking performance click per query rate, decreased value indicates worse ranking performance 6. Fill in the blanks to finish the derivation of ranking formula for the query generation model. (6pts) log p(q d) = c(w, q) log p(w d) w q = c(w, q) log p seen (w d) + w q d = w q,w / d c(w, q) log p seen (w d) + w q d w q c(w, q) log α d p(w C) = w q d w q d c(w, q) log α d p(w C) c(w, q) log α d p(w C) c(w, q) log p seen(w d) α d p(w C) + q log α d + c(w, q) log p(w C) w q 6

3 Essay Questions (44 pts) All the following questions focus on system/algorithm design. Please think about all the methods and concepts we have discussed in class (including those from the students paper presentations) and try to give your best designs in terms of feasibility, comprehensiveness and novelty. When necessary, you can draw diagrams or write pseudo codes to illustrate your idea. 1. Given the following four documents in our archive, where the first two documents are stored in machine one, and the second two documents are stored in machine two, draw the procedure of MapReduce-based inverted index construction, and the resulting inverted index. (16 pts) new home sales top forecasts home sales rise in july increase in home sales in july july new home sales rise Step 1. Parse & Count Step 2. Local Sort Step 3. Merge Sort Step 4. Form Inverted Index <new, 1, 1> <home, 1, 1> <sales, 1, 1> <top, 1, 1> <forecasts, 1, 1> <home, 2, 1> <sales, 2, 1> <rise, 2, 1> <in, 2, 1> <july, 2, 1> <increase, 3, 1> <in, 3, 2> <home, 3, 1> <sales, 3, 1> <july, 3, 1> <july, 4, 1> <new, 4, 1> <home, 4, 1> <sales, 4, 1> <rise, 4, 1> <forecasts, 1, 1> <home, 1, 1> <home, 2, 1> <in, 2, 1> <july, 2, 1> <new, 1, 1> <rise, 2, 1> <sales, 1, 1> <sales, 2, 1> <top, 1, 1> <home, 3, 1> <home, 4, 1> <in, 3, 2> <increase, 3, 1> <july, 3, 1> <july, 4, 1> <new, 4, 1> <rise, 4, 1> <sales, 3, 1> <sales, 4, 1> <forecasts, 1, 1> <home, 1, 1> <home, 2, 1> <home, 3, 1> <home, 4, 1> <in, 2, 1> <in, 3, 2> <increase, 3, 1> <july, 2, 1> <july, 3, 1> <july, 4, 1> <new, 1, 1> <new, 4, 1> <rise, 2, 1> <rise, 4, 1> <sales, 1, 1> <sales, 2, 1> <sales, 3, 1> <sales, 4, 1> <top, 1, 1> forecasts <1,1> home <1,1> <1,1> <1,1> <1,1> in <2,1> <1,2> increase <3,1> july <2,1> <1,1> new <1,1> <3,1> rise <2,1> <2,1> <1,1> sales <1,1> <1,1> <1,1> <1,1> top <1,1> The tuples in Step 1 to 3 are in the format of <term, docid, count>. And the tuples in the posting list of constructed inverted index are in the format of <gap, count>. The first tuple in the posting list is in the format of <docid, count>. We will not give you penalty if your result inverted index stores document ID rather than the gap between document IDs. 7

2. Given the following hyperlink structure among four web pages, demonstrate the step by step construction of the transition matrix for PageRank with dumping factor α = 0.1. You are suggested to write down the intermediate results. (12 pts) 1 2 3 4 1. adjacency matrix: 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 2. enable random jump on dead end: 0 1 0 1 0.25 0.25 0.25 0.25 0 0 0 1 0 1 1 0 3. normalization: 0 0.5 0 0.5 0.25 0.25 0.25 0.25 0 0 0 1 0 0.5 0.5 0 4. enable random jump on all nodes: 0 0.5 0 0.5 0.25 0.25 0.25 0.25 0.1 0.25 0.25 0.25 0.25 0 0 0 1 + 0.9 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0 0.5 0.5 0 0.25 0.25 0.25 0.25 8

3. Bing wants to become a better personalized search engine. Combining the concepts and techniques that we have learned in this semester, please give your concrete suggestions of where and how an IR system can be personalized. Please cover at least three components in a typical retrieval system, e.g., query processing module, ranking functions, and feedback modeling. (16 pts) In query processing module, we can add personalized query expansion or query topic classification based on users query history. E.g., if a user often searches for programming languages, when he/she issues query java, we can automatically append keywords like programming language to it (with relatively lower weight) and associate it with topical category information technology, programming language. In ranking function module, we can build a personalized user profile based on Rocchio feedback or query language model based feedback techniques. In query time, we can rerank the initial search results by such user profile. Building different learning to rank algorithms to get personalized ranking feature combination is also feasible. In feedback module, based different users click behavior, we can build different examination models, e.g., some users tend to exam more results before click, while some users tend to open more links and read them one by one. These patterns indicate different relevance quality of the clicked results. In result display module, we can also adapt to individual users preference. For example, some users might prefer image vertical being displayed before the video vertical, some users might prefer read more text content. Such display settings can also be personalized accordingly. 9

4 Bonus Questions (10 pts) All these questions are supposed to be open research questions. Your answers have to be very specific to convince the instructor that you deserve the bonus (generally mention some broad concepts will not count). 1. How to perform better mobile search? Credit is given based on the novelty of your proposal. (10pts) This is an open research question, and I am expecting some new ideas that are beyond what we have discussed in class, e.g., using locations and contacts. 10