Document Representation : Quiz

Document Representation : Quiz Q1. In-memory Index construction faces following problems:. (A) Scaling problem (B) The optimal use of Hardware resources for scaling (C) Easily keep entire data into main memory (D) Use merge sort Q2. Choose the correct statement(s) for "Sort-based index construction: (A) Uses Quick sort (B) Is a three step process: (1) Parsing and Index construction, (2) Sorting and (3) Merging (C) sufferers with Scaling problem (D) Use external sorting techniques Q3. Choose the correct statement(s):

(A) In-Memory sorting is not applicable for index construction, as we cannot keep whole data into memory at a time (B) Disk seeks for Inmemory indexing will be very time consuming (C) The external sorting with sequential disk seek can be beneficial for index construction (D) Merge sort has some bottleneck. (i.e. Random disk seek is slower then sequential disk seek Q4. Select the appropriate time complexity for Quick short and Merge short, for given 'N' number of elements (A) O(N), O(N^2) (B) O(N^2), O(NlogN) (C) O(NlogN), O(N^2) (D) O(N^2), O(N^2) Q5. Identify the correct statements regarding BSBI and SPIMI (A) Blocked sort-based indexing has excellent scaling properties, but it needs a data structure for mapping terms to term.

(B) SPIMI uses term instead of termid s (C) A difference between BSBI and SPIMI is that, SPIMI adds a posting directly to its posting list. Instead of first collecting all termid - docid pairs and then sorting them (D) All of the above Q6. Select true statements regarding SPIMI (A) Different from BSBI, it generates separate dictionaries for each block. (B) It does not maintain term-termid mapping across blocks (which requires huge in-memory operations). (C) It does not apply term-id, doc-id based sorting. Accumulate postings in postings lists as they occur. (D) It uses dictionary to generate a complete inverted index for each block. Q7. Select correct statements regarding Distributed indexing (A) In Distributed indexing for web-scale indexing, we must use a distributed computing cluster

(B) Maintain a master machine directing the indexing job (C) Master machine assigns each task to an idle machine from a pool (D) Only (B) and (C) is correct Q8. Select the correct statements regarding MAP-REDUCE (A) MapReduce is a distributed programming tool designed for indexing and analysis tasks (B) Uses a Master machine which breaks the indexing into sets of (parallel) tasks and passes it to different machines (nodes) and assigns each task to an idle machine from a pool. (C) The reduce step of Map-Reduce reduces the set of indexes by deleting less frequent indexes. (D) The reduce step uses some reduction function to reduce the set of indexes by deleting less frequent indexes. Q9. Select the correct Map prepared by using Map-function of Map- Reduce [ D1 : ID came, ID c ed +, *D2 : ID died] (A) <ID,D1>, <came,d1>, <ID,D1>, <c ed, D1>, <ID, D2>, <died,d2>

(B) <ID,D1>, <came,d1>, <c ed, D1>, <ID, D2>, <died,d2> (C) <ID,D1>, <came,d1>, <c ed, D1>, <died,d2> (D) Non of the above Q10. This question is related to question 9 (see above). For the given set of indexes: [ D1 : ID came, ID c ed +, *D2 : ID died], select the correct final reduced indexes. (A) (<ID,(D1:2,D2:1)>, <died,(d2:1)>, <came,(d1:1)>, <c ed,(d1:1)>) (B) (<ID,(D1,D2,D1)>, <died,(d2)>, <came,(d1)>, <c ed,(d1)>) (C) <ID,D1>, <came,d1>, <ID,D1>, <c ed, D1>, <ID, D2>, <died,d2> (D) non of the above Q11. Select the correct statements related to Dynamic-Indexing (A) Useful in the case, where frequency of deletion or modification of web pages are very high (B) It requires frequent modification in (1) Posting lists and (2) Dictionary, for each new addition and modification of web-pages

(C) not useful for web pages containing images and videos. (D) Useful for web page-indexing or dynamically changing digital libraries Q12. Select the correct statements related to "Dynamic indexing at search engines" (A) All the large search engines now do dynamic indexing. (B) News items, blogs, new topical web pages show frequent changes. (C) To manage the changes and updates they periodically reconstruct the index from scratch. (D) But they do not make any change in Query processing based on old index. Q13. The dynamic indexing uses "main and auxiliary indexes". Select the correct related statement(s). (A) Frequent merges are very easy and unproblematic. (B) Problem of frequent merges (C) Merging of the auxiliary index into the main index is efficient if we

keep a separate file for each postings list. (D) Instead of using one big file to store indexes, use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.) Q14. Select the correct statements regarding Binary Independence Model: (A) It assumes that documents are binary vectors, i.e. only presence or absences of terms in documents are recorded. (B) Terms are independently distributed in the set of relevant documents and irrelevant documents. (C) The representation is an ordered set of Boolean variables. (D) Independence signifies that terms in the document are considered independently from each other and no association between them is modeled. Q15. The Major differences/similarities between BM25 and BM25F are: (A) BM25 uses bag of words based approach but BM25F doesn't (B) Both uses bag of words based approach

(C) Both uses bigram based model (D) BM25F gives different importance to terms appears in (1) BODY, (2) Title and (3) Anchor text, but BM25 treat all words equally. Q16. Identify the correct statements related to "Okapi BM25" (A) Okapi BM25 is a document searching function used by search engines to just search matching documents according to their relevance to a given search query. (B) Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. (C) It is based on the probabilistic retrieval framework. (D) It is totally based on HMM. Q17. The major demerits of "Okapi BM25" are: (A) BM25 is a biagram based retrieval function that ranks a set of documents based on the query terms appearing in each document. (B) BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document.

(C) It does not consider the inter-relationship between the query terms within a document (e.g., their relative proximity). (D) It always considers the inter-relationship between the query terms within a document (e.g., their relative proximity). Q18. Identify the correct statements according to "Okapi BM25" equation given below: (A) f(qi,d) is the qi 's term frequency in document 'D' D is the length of document 'D' in words D is the length of all documents in words (D) 'avgdl' is the average document length of entire text collection from where, the text is collected Q19. For the IDF part of equation of BM25 (see Q17 for complete equation), identify the correct statements

(A) 'N' is the total number of documents in the collection (B) n(qi) is the number of documents containing 'qi' (C) In the original BM25 derivation, the IDF component is derived from the Binary Independence Model. (D) None of the above Q20. Identify the correct statements related to Binary Independence Model (A) It is a probabilistic information retrieval technique that makes some simple assumptions to make the estimation of document/query similarity probability feasible. (B) Uses the relation between words of given documents. (C) This assumption allows the representation to be treated as an instance of a Vector space model by considering each term as a value of 0 or 1 along a dimension orthogonal to the dimensions used for the other terms. (D) The Binary Independence Assumption is that documents are binary vectors

KEYS: 1 - A,B; 2-B,D; 3-A,B,C,D ; 4-B ; 5-D ; 6-A, B, C, D; 7-A, B, C; 8- A, B; 9-A ; 10- A; 11-A, B, D; 12-A, B, C; 13- A, C, D; 14- A, B, C, D; 15-B, D; 16- B, C; 17-B, C; 18-A, B, D; 19-A, B, C; 20-A, C, D;