Midterm Exam Search Engines ( / ) October 20, 2015

Similar documents
Information Retrieval. (M&S Ch 15)

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

Chapter 6: Information Retrieval and Web Search. An introduction

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Final Exam Search Engines ( / ) December 8, 2014

CS54701: Information Retrieval

Instructor: Stefan Savev

CS 6320 Natural Language Processing

Birkbeck (University of London)

Boolean Model. Hongning Wang

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

CSE 494: Information Retrieval, Mining and Integration on the Internet

Query Processing and Alternative Search Structures. Indexing common words

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CMSC 476/676 Information Retrieval Midterm Exam Spring 2014

Document indexing, similarities and retrieval in large scale text collections

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

CMPSCI 646, Information Retrieval (Fall 2003)

Information Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Introduction to Information Retrieval

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Exam IST 441 Spring 2014

Information Retrieval. Information Retrieval and Web Search

Query Answering Using Inverted Indexes

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Basic techniques. Text processing; term weighting; vector space model; inverted index; Web Search

CISC689/ Information Retrieval Midterm Exam

Vector Space Scoring Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Information Retrieval

CSCI 5417 Information Retrieval Systems. Jim Martin!

Notes: Notes: Primo Ranking Customization

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Session 10: Information Retrieval

Multimedia Information Systems

modern database systems lecture 4 : information retrieval

Relevance of a Document to a Query

CS47300: Web Information Search and Management

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Automatic Summarization

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Lecture 5: Information Retrieval using the Vector Space Model

Knowledge Discovery and Data Mining 1 (VO) ( )

Information Retrieval. Lecture 5 - The vector space model. Introduction. Overview. Term weighting. Wintersemester 2007

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

Information Retrieval and Web Search

COMP6237 Data Mining Searching and Ranking

Introduction to Information Retrieval

Introduction to Information Retrieval

Information Retrieval

Assignment No. 1. Abdurrahman Yasar. June 10, QUESTION 1

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Evaluation. David Kauchak cs160 Fall 2009 adapted from:

Information Retrieval. hussein suleman uct cs

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Retrieval Evaluation. Hongning Wang

Effect of log-based Query Term Expansion on Retrieval Effectiveness in Patent Searching

Developing MapReduce Programs

Unstructured Data. CS102 Winter 2019

Social Media Computing

Digital Libraries: Language Technologies

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

Web Search Engine Question Answering

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Information Retrieval

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Information Retrieval

Efficient query processing

Document Representation : Quiz

Overview. Lecture 6: Evaluation. Summary: Ranked retrieval. Overview. Information Retrieval Computer Science Tripos Part II.

Computer Science 572 Midterm Prof. Horowitz Tuesday, March 12, 2013, 12:30pm 1:45pm

Models for Document & Query Representation. Ziawasch Abedjan

Query Evaluation Strategies

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Web Information Retrieval using WordNet

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 1)"

VK Multimedia Information Systems

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Information Retrieval

68A8 Multimedia DataBases Information Retrieval - Exercises

Outline of the course

Level of analysis Finding Out About Chapter 3: 25 Sept 01 R. K. Belew

Web Information Retrieval. Lecture 4 Dictionaries, Index Compression

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Inverted Indexes. Indexing and Searching, Modern Information Retrieval, Addison Wesley, 2010 p. 5

Chapter III.2: Basic ranking & evaluation measures

Information Retrieval: Retrieval Models

60-538: Information Retrieval

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Transcription:

Student Name: Andrew ID: Seat Number: Midterm Exam Search Engines (11-442 / 11-642) October 20, 2015 Answer all of the following questions. Each answer should be thorough, complete, and relevant. Points will be deducted for irrelevant details. Use the back of the pages if you need more room for your answer. Calculators, phones, and other computational devices are not permitted. If a calculation is required, write fractions, and show your work so that it is clear that you know how to do the calculation. Advice about exam answers... Sometimes an answer says "I would use <technique> to do <x>". That answer shows that you remember a name, but it does not show that you remember how the technique works, or why it is the right tool for this problem. Give a brief description of how the technique works and why it is the right tool for this job. If the technique needs other information, explain where the information comes from..

1 Evaluation Suppose that a large health provider has a website with a search engine that allows patients to find information about staying fit, eating well, diseases, treatments, and tests. The search engine receives about 35 queries per week. The company doesn t have data for evaluating the accuracy of the search engine, and doesn t know its accuracy. Describe how you would evaluate the accuracy of the search engine. Be clear about the method you would use, the data it would require, how much data it would require, and how you would get the data. Explain why your method is the right choice for this problem. [15 points] Answer The search engine doesn t receive much traffic, so there isn t enough click data and there isn t enough traffic to use interleaved testing. The Cranfield methodology is the best choice in this situation. Start by randomly sampling 50-100 queries from the query log. Develop written information needs to describe what each query is about. If possible, index the documents with several open-source search engines, run each query against each search engine, and pool the results for each query to form a pool of documents to be assessed. If it is not possible to use several open-source search engines, then create several (e.g., 5) variants of each query, run each variant against the search engine, and pool the results to form a pool of documents to be assessed. The size of the pools is determined by the available budget, and the nature of the problem; probably top 100 is sufficient for this task because most web-site visitors won t search very deeply into the results. Sort the pool of documents for each query into a random order. Have someone assess the results, either on a binary scale (relevant vs. non-relevant) or on a 5-point scale. Use several metrics (P@n, MAP, NDCG) to assess the search engine quality.

2 Document Structure Two queries and two structured documents are shown below. For each query, explain i) what the query ranks; ii) whether information from different paragraphs is combined, and if so, how; iii) how scores are calculated (do not write formulas), and iv) whether it gives a higher score to Document 1 or Document 2 and why (assume that they have the same overall length). [16 points] Query 1 : #and (silicon.paragraph valley.paragraph bubble.paragraph) Query 2 : #and[paragraph] (silicon valley bubble) Document 1 Document 2 My Favorite Bubbles and Valleys title Is it Finally Over?. An Irreverent Journey of Faith happens on Bubble girl. paragraph 1 There are 4 Signs That Silicon Valley's Tech Bubble Is Bursting... My leakiness is all the more evident paragraph 2 Tourists are taking the bus out of Silicon against the shriveled, brown autumn landscape of Valley.. Valley and There are many minerals around their One of the big reasons for the blowing up hometown, silicon, iron of the latest bubble has been big money silicon gulch tourists.. bubble, bubble, toil and trouble crossing the valley of death paragraph 3. proliferation of unicorns in Silicon Valley is another sign of Bubble. Answer: Query 1 ranks documents. Query 2 ranks paragraphs. Query 1 combines the information from all paragraph elements into a single inverted list. Then it uses the default document-scoring algorithm (e.g., MLE with two-stage smoothing) to calculate a score for the document. Query 2 does not combine information from different paragraphs. Instead, a score is calculated for each individual paragraph element using a mixture model scoring algorithm that is a weighted average of the MLE score for the element, document-level smoothing, and corpus-level smoothing. Query 1 ranks Document 2 ahead of Document 1 because Document 2 has more occurrences of silicon ; the two documents have the same number of other query terms. Query 2 doesn t rank documents.

3 Retrieval Models Inverse document frequency (idf) and the Robertston Sparck-Jones (RSJ) weight are used to accomplish a similar purpose. Write both formulas (and define your variables). Explain their purpose (i.e., why their behavior is desired or important). Describe how their behavior differs. For example, if you replaced the BM25 RSJ weight with idf, what would be the effect on the document ranking algorithm? [15 points] Answer: idf (t): Any of the following formulas are acceptable: N N 1 idf t log idf t df t df t RSJ(t): Both of the following formulas are acceptable: N df 0.5 df 0.5 RSJ t log t RSJ t MAX 0,log t t log idf t log 1 N df 0.5 dft 0.5 where N is the number of documents in the corpus and df t is the number of documents that contain t. Both formulas favor words that are rare in the corpus. Rare words are better able to discriminate relevant documents from non-relevant documents. The RSJ weight has a stronger reward for rare terms and a stronger penalty for frequent terms. Words that occur in more than half of the corpus get a weight of 0. If the RSJ weight is replaced with idf, the penalty for frequent words is reduced especially for words that occur in more than half of the documents. N df t

4 Document Representation Search engines use shallow language analysis and heuristics to convert a document a bag-of-words (index terms). Why is lexical processing critical to the performance of search engines? Describe four lexical processing methods. For each method, give examples to show its advantages and disadvantages, or the tradeoff of using this method. [16 points] Answer Lexical processing transforms a document from natural language to a representation of meaning that a search engine can process quickly when a query is received. This transformation must produce a representation that is simpler, but that also retains all of the important meaning of the document. Often poor accuracy is due to a poor text representation. Types of lexical processing: Case conversion: Apple -> apple. Advantages: Improves recall, matches more queries. Disadvantages: Apple may be used as a company name, while apple will be considered as a kind of fruit. Stemming: apples -> apple. Advantages: Improves recall, matches lexical variants that have the same meaning. Disadvantage: May conflate words that a person would consider different (e.g., execute and executive ) or different in a particular situation (e.g., the company apple and apples ). Stopword Removal: the, of, a. Advantages: Discard meaningless word, reduce index size, improve accuracy. Disadvantage: to be or not to be after removal, it makes some queries difficult to satisfy. If there is no disk limit, we could store stop words in our index, and we could specify rules that if stopwords are more than half the terms, we still keep it when matching queries. De-compounding: computer-virus - > computer, virus: Advantages: Provides a more accurate representation of the document, enables a broader range of queries to match. Disadvantage: n-grams like roe v. wade will be meaningless. Other possible answer will be: phrase, POS

5 Relevance Feedback What is relevance feedback? Provide two reasons that it is it not used in web search engines. Describe a task or environment where relevance feedback could be expected to be used effectively. [14 points] Answer Relevance feedback is a supervised machine learning approach to creating improved queries. It learns a larger bag-of-words and term weights from example documents. It is not used much in web search engines because i) users dislike giving feedback, and ii) there is a high risk of learning a poor query when a user only judges a small number of relevant documents. Relevance feedback could be expected to be effective in high-recall tasks such as legal or patent search where users are willing to judge many documents to produce a query that has a high likelihood of finding many relevant documents.

6 Heaps Law Write the formula for Heaps Law (define your terms). Give one practical example of its use (i.e., a situation where it would be useful). [10 points] Answer V KN β where K is a constant. Usually 10 K 100. N is the number of word occurrences in the corpus β is a constant. Usually 0.4 β 0.6 for English. Heaps Law is useful for predicting the size of a term dictionary before building a system.

7 Indexing Describe Top docs (Champion) Lists. What is their purpose, how are they constructed, and how many terms have a Top docs list (give an intuitive explanation, not a formula)? What is an advantage and a disadvantage of using top docs lists? What kinds of query operators are a good match to Top docs lists, and what kinds are a poor match? [14 points] Answer: A Top docs list is a type of inverted list that contains only the best documents that match a query term. Top docs lists are constructed by sorting the full inverted list on some metric, for example tf, p_smoothed (t d), or PageRank (d), and then selecting just the top N documents. Typically a Top docs list is designed to fit into one or two operating system pages. Given Zipfs Law, typically only a small percentage of terms is frequent enough to have a Top docs list. The advantage of Top docs lists is that they are much faster to process than full inverted lists. The disadvantage is that they don t contain all documents, so the document ranking may be less accurate than a ranking that would have been gotten with full lists. Score operators except RankedBoolean #AND, because they can match a document even if some arguments do not match. Bad match: Boolean #AND operators because they only match if all arguments match. #NEAR and #WINDOW are off topic because Top docs lists don t contain positions.