Document Representation : Quiz

Similar documents
Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Information Retrieval

Index Construction 1

index construct Overview Overview Recap How to construct index? Introduction Index construction Introduction to Recap

Introduc)on to. CS60092: Informa0on Retrieval

Information Retrieval and Organisation

Introduction to. CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan. Lecture 4: Index Construction

Information Retrieval

Introduction to Information Retrieval

Index Construction. Slides by Manning, Raghavan, Schutze

CSCI 5417 Information Retrieval Systems Jim Martin!

Information Retrieval

Information Retrieval

Information Retrieval

INDEX CONSTRUCTION 1

Informa(on Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

COMP6237 Data Mining Searching and Ranking

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Introduction to Information Retrieval

CSE 7/5337: Information Retrieval and Web Search Index construction (IIR 4)

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

Reuters collection example (approximate # s)

PV211: Introduction to Information Retrieval

Index construc-on. Friday, 8 April 16 1

Information Retrieval. (M&S Ch 15)

Chapter 6: Information Retrieval and Web Search. An introduction

Index Construction Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Index Construction Introduction to Information Retrieval INF 141 Donald J. Patterson

Information Retrieval. Danushka Bollegala

Introduction to Information Retrieval

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Building an Inverted Index

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Lecture 3 Index Construction and Compression. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

A Security Model for Multi-User File System Search. in Multi-User Environments

Parallel Programming Concepts

Administrative. Distributed indexing. Index Compression! What I did last summer lunch talks today. Master. Tasks

Lecture 5: Information Retrieval using the Vector Space Model

Full-Text Indexing For Heritrix

Hadoop Map Reduce 10/17/2018 1

Text Retrieval an introduction

Instructor: Stefan Savev

Midterm Exam Search Engines ( / ) October 20, 2015

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Document indexing, similarities and retrieval in large scale text collections

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Efficient query processing

Models for Document & Query Representation. Ziawasch Abedjan

Query Evaluation Strategies

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

THE WEB SEARCH ENGINE

that system. weighted value associated with it. numbers. a number. the absence of a signal. MECH 1500 Quiz 2 Review Name: Class: Date:

Indexing. UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze

Indexing Methods. Lecture 9. Storage Requirements of Databases

Querying Introduction to Information Retrieval INF 141 Donald J. Patterson. Content adapted from Hinrich Schütze

Midterm spring. CSC228H University of Toronto

Indexing. CS6200: Information Retrieval. Index Construction. Slides by: Jesse Anderton

Query Answering Using Inverted Indexes

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Efficiency vs. Effectiveness in Terabyte-Scale IR

Information Retrieval

Query Evaluation Strategies

Structural Text Features. Structural Features

MG4J: Managing Gigabytes for Java. MG4J - intro 1

Distributed computing: index building and use

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2015 Quiz I

Representation of Documents and Infomation Retrieval

Dept. Of Computer Science, Colorado State University

MapReduce: Simplified Data Processing on Large Clusters. By Stephen Cardina

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

Information Retrieval CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Information Retrieval. Lecture 2 - Building an index

Melbourne University at the 2006 Terabyte Track

Advance Indexing. Limock July 3, 2014

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Homework 3: Wikipedia Clustering Cliff Engle & Antonio Lupher CS 294-1

CS105 Introduction to Information Retrieval

Index construc-on. Friday, 8 April 16 1

RMIT University at TREC 2006: Terabyte Track

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Session 10: Information Retrieval

Review: Memory, Disks, & Files. File Organizations and Indexing. Today: File Storage. Alternative File Organizations. Cost Model for Analysis

Transformer Looping Functions for Pivoting the data :

Information Retrieval

60-538: Information Retrieval

Transcription:

Document Representation : Quiz Q1. In-memory Index construction faces following problems:. (A) Scaling problem (B) The optimal use of Hardware resources for scaling (C) Easily keep entire data into main memory (D) Use merge sort Q2. Choose the correct statement(s) for "Sort-based index construction: (A) Uses Quick sort (B) Is a three step process: (1) Parsing and Index construction, (2) Sorting and (3) Merging (C) sufferers with Scaling problem (D) Use external sorting techniques Q3. Choose the correct statement(s):

(A) In-Memory sorting is not applicable for index construction, as we cannot keep whole data into memory at a time (B) Disk seeks for Inmemory indexing will be very time consuming (C) The external sorting with sequential disk seek can be beneficial for index construction (D) Merge sort has some bottleneck. (i.e. Random disk seek is slower then sequential disk seek Q4. Select the appropriate time complexity for Quick short and Merge short, for given 'N' number of elements (A) O(N), O(N^2) (B) O(N^2), O(NlogN) (C) O(NlogN), O(N^2) (D) O(N^2), O(N^2) Q5. Identify the correct statements regarding BSBI and SPIMI (A) Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to term.

(B) SPIMI uses term instead of termid s (C) A difference between BSBI and SPIMI is that, SPIMI adds a posting directly to its posting list. Instead of first collecting all termid - docid pairs and then sorting them (D) All of the above Q6. Select true statements regarding SPIMI (A) Different from BSBI, it generates separate dictionaries for each block. (B) It does not maintain term-termid mapping across blocks (which requires huge in-memory operations). (C) It does not apply term-id, doc-id based sorting. Accumulate postings in postings lists as they occur. (D) It uses dictionary to generate a complete inverted index for each block. Q7. Select correct statements regarding Distributed indexing (A) In Distributed indexing for web-scale indexing, we must use a distributed computing cluster

(B) Maintain a master machine directing the indexing job (C) Master machine assigns each task to an idle machine from a pool (D) Only (B) and (C) is correct Q8. Select the correct statements regarding MAP-REDUCE (A) MapReduce is a distributed programming tool designed for indexing and analysis tasks (B) Uses a Master machine which breaks the indexing into sets of (parallel) tasks and passes it to different machines (nodes) and assigns each task to an idle machine from a pool. (C) The reduce step of Map-Reduce reduces the set of indexes by deleting less frequent indexes. (D) The reduce step uses some reduction function to reduce the set of indexes by deleting less frequent indexes. Q9. Select the correct Map prepared by using Map-function of Map- Reduce [ D1 : ID came, ID c ed +, *D2 : ID died] (A) <ID,D1>, <came,d1>, <ID,D1>, <c ed, D1>, <ID, D2>, <died,d2>

(B) <ID,D1>, <came,d1>, <c ed, D1>, <ID, D2>, <died,d2> (C) <ID,D1>, <came,d1>, <c ed, D1>, <died,d2> (D) Non of the above Q10. This question is related to question 9 (see above). For the given set of indexes: [ D1 : ID came, ID c ed +, *D2 : ID died], select the correct final reduced indexes. (A) (<ID,(D1:2,D2:1)>, <died,(d2:1)>, <came,(d1:1)>, <c ed,(d1:1)>) (B) (<ID,(D1,D2,D1)>, <died,(d2)>, <came,(d1)>, <c ed,(d1)>) (C) <ID,D1>, <came,d1>, <ID,D1>, <c ed, D1>, <ID, D2>, <died,d2> (D) non of the above Q11. Select the correct statements related to Dynamic-Indexing (A) Useful in the case, where frequency of deletion or modification of web pages are very high (B) It requires frequent modification in (1) Posting lists and (2) Dictionary, for each new addition and modification of web-pages

(C) not useful for web pages containing images and videos. (D) Useful for web page-indexing or dynamically changing digital libraries Q12. Select the correct statements related to "Dynamic indexing at search engines" (A) All the large search engines now do dynamic indexing. (B) News items, blogs, new topical web pages show frequent changes. (C) To manage the changes and updates they periodically reconstruct the index from scratch. (D) But they do not make any change in Query processing based on old index. Q13. The dynamic indexing uses "main and auxiliary indexes". Select the correct related statement(s). (A) Frequent merges are very easy and unproblematic. (B) Problem of frequent merges (C) Merging of the auxiliary index into the main index is efficient if we

keep a separate file for each postings list. (D) Instead of using one big file to store indexes, use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.) Q14. Select the correct statements regarding Binary Independence Model: (A) It assumes that documents are binary vectors, i.e. only presence or absences of terms in documents are recorded. (B) Terms are independently distributed in the set of relevant documents and irrelevant documents. (C) The representation is an ordered set of Boolean variables. (D) Independence signifies that terms in the document are considered independently from each other and no association between them is modeled. Q15. The Major differences/similarities between BM25 and BM25F are: (A) BM25 uses bag of words based approach but BM25F doesn't (B) Both uses bag of words based approach

(C) Both uses bigram based model (D) BM25F gives different importance to terms appears in (1) BODY, (2) Title and (3) Anchor text, but BM25 treat all words equally. Q16. Identify the correct statements related to "Okapi BM25" (A) Okapi BM25 is a document searching function used by search engines to just search matching documents according to their relevance to a given search query. (B) Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. (C) It is based on the probabilistic retrieval framework. (D) It is totally based on HMM. Q17. The major demerits of "Okapi BM25" are: (A) BM25 is a biagram based retrieval function that ranks a set of documents based on the query terms appearing in each document. (B) BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document.

(C) It does not consider the inter-relationship between the query terms within a document (e.g., their relative proximity). (D) It always considers the inter-relationship between the query terms within a document (e.g., their relative proximity). Q18. Identify the correct statements according to "Okapi BM25" equation given below: (A) f(qi,d) is the qi 's term frequency in document 'D' D is the length of document 'D' in words D is the length of all documents in words (D) 'avgdl' is the average document length of entire text collection from where, the text is collected Q19. For the IDF part of equation of BM25 (see Q17 for complete equation), identify the correct statements

(A) 'N' is the total number of documents in the collection (B) n(qi) is the number of documents containing 'qi' (C) In the original BM25 derivation, the IDF component is derived from the Binary Independence Model. (D) None of the above Q20. Identify the correct statements related to Binary Independence Model (A) It is a probabilistic information retrieval technique that makes some simple assumptions to make the estimation of document/query similarity probability feasible. (B) Uses the relation between words of given documents. (C) This assumption allows the representation to be treated as an instance of a Vector space model by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms. (D) The Binary Independence Assumption is that documents are binary vectors

KEYS: 1 - A,B; 2-B,D; 3-A,B,C,D ; 4-B ; 5-D ; 6-A, B, C, D; 7-A, B, C; 8- A, B; 9-A ; 10- A; 11-A, B, D; 12-A, B, C; 13- A, C, D; 14- A, B, C, D; 15-B, D; 16- B, C; 17-B, C; 18-A, B, D; 19-A, B, C; 20-A, C, D;