Birkbeck (University of London)

Similar documents
Chapter 6: Information Retrieval and Web Search. An introduction

CS371R: Final Exam Dec. 18, 2017

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

CSE 494: Information Retrieval, Mining and Integration on the Internet

Midterm Exam Search Engines ( / ) October 20, 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Information Retrieval and Organisation

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Information Retrieval. Techniques for Relevance Feedback

VECTOR SPACE CLASSIFICATION

Models for Document & Query Representation. Ziawasch Abedjan

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

Birkbeck (University of London)

Birkbeck (University of London) MSc Examination. Department of Computer Science and Information Systems. Fundamentals of Computing (COIY058H7)

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Wednesday June 4, 2008 Time:

VALLIAMMAI ENGINEERING COLLEGE SRM Nagar, Kattankulathur DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING QUESTION BANK VII SEMESTER

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Glossary of IR terms

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

James Mayfield! The Johns Hopkins University Applied Physics Laboratory The Human Language Technology Center of Excellence!

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Query Answering Using Inverted Indexes

1 Document Classification [60 points]

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

Classification. 1 o Semestre 2007/2008

How do microarrays work

Chapter 27 Introduction to Information Retrieval and Web Search

Search Engines. Information Retrieval in Practice

Retrieval by Content. Part 3: Text Retrieval Latent Semantic Indexing. Srihari: CSE 626 1

Introduction to Information Retrieval

Birkbeck (University of London) Department of Computer Science and Information Systems. Introduction to Computer Systems (BUCI008H4)

A Content Vector Model for Text Classification

Clustering (COSC 416) Nazli Goharian. Document Clustering.

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Module Contact: Dr Dan Smith, CMP Copyright of the University of East Anglia Version 1

Lecture 5: Information Retrieval using the Vector Space Model

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Mining Web Data. Lijun Zhang

Project Report on winter

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

Social Media Computing

Information Retrieval

Information Retrieval. Information Retrieval and Web Search

Information Retrieval

CSE 190, Fall 2015: Midterm

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

To earn the extra credit, one of the following has to hold true. Please circle and sign.

modern database systems lecture 4 : information retrieval

Lecture 7: Relevance Feedback and Query Expansion

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Information Retrieval and Web Search Engines

Large Scale Image Retrieval

Exam IST 441 Spring 2014

Exam IST 441 Spring 2011

Based on Raymond J. Mooney s slides

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Fall 2018 CSE 482 Big Data Analysis: Exam 1 Total: 36 (+3 bonus points)

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

Classification and Clustering

Birkbeck (University of London) Department of Computer Science and Information Systems. Introduction to Computer Systems (BUCI008H4)

International Journal of Advanced Research in Computer Science and Software Engineering

Data-analysis and Retrieval Boolean retrieval, posting lists and dictionaries

Clustering: Overview and K-means algorithm

Concept-Based Document Similarity Based on Suffix Tree Document

Text Analytics (Text Mining)

Information Retrieval

Static Pruning of Terms In Inverted Files

Keyword Extraction by KNN considering Similarity among Features

Clustering CE-324: Modern Information Retrieval Sharif University of Technology

Text Analytics (Text Mining)

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Lemur (Draft last modified 12/08/2010)

Chapter 9. Classification and Clustering

Recap of the previous lecture. This lecture. A naïve dictionary. Introduction to Information Retrieval. Dictionary data structures Tolerant retrieval

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

IBL and clustering. Relationship of IBL with CBR

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INF4820, Algorithms for AI and NLP: Hierarchical Clustering

Introduction to Information Retrieval

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Document indexing, similarities and retrieval in large scale text collections

International Journal of Advance Foundation and Research in Science & Engineering (IJAFRSE) Volume 1, Issue 2, July 2014.

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Query Languages. Berlin Chen Reference: 1. Modern Information Retrieval, chapter 4

Text Categorization (I)

ANSWERS. Birkbeck (University of London) Software and Programming 1 In-class Test Feb Student Name Student Number. Answer all questions

ANSWERS. Birkbeck (University of London) Software and Programming 1 In-class Test Feb Student Name Student Number. Answer all questions

Information Retrieval and Web Search Engines

COMP6237 Data Mining Searching and Ranking

Final Exam Search Engines ( / ) December 8, 2014

Information Retrieval and Web Search

Natural Language Processing

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

INFSCI 2140 Information Storage and Retrieval Lecture 6: Taking User into Account. Ad-hoc IR in text-oriented DS

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Transcription:

Birkbeck (University of London) MSc Examination for Internal Students Department of Computer Science and Information Systems Information Retrieval and Organisation (COIY64H7) Credit Value: 5 Date of Examination: xxxday x xxx Duration of Paper: xx: - xx: ( hours) RUBRIC. This paper contains xx questions for a total of marks.. Students should attempt to answer all of them.. This paper is not prior-disclosed. 4. The use of non-programmable electronic calculators is permitted. c Birkbeck College COIY64H7 (Page of 5)

. (5 marks) (a) Briefly explain the difference between Tokenisation and Normalisation. (4 marks) (b) What is the most common algorithm used for stemming English? ( mark). (6 marks) Assume an IR system needs an inverted file index that maps -grams to the terms that contain the -grams for tolerant retrieval. Start with an empty index and insert the terms the, their, theme, and they into this index. What will the final index look like?. (4 marks),, numbers are sorted in ascending order using a replacement selection sorting algorithm in the first step of external sorting. The main memory buffer can hold,, numbers. (a) (b) How many blocks would be produced in the output if the numbers in the input are already sorted in ascending order? How many blocks would be produced in the output if the numbers in the input are sorted in descending order? 4. ( marks) A system supports wildcard queries with a permuterm index. What steps would have to be taken to process the following query: *A*B*? Assume that only prefix queries can be answered efficiently by the permuterm index. 5. (4 marks) During an evaluation of an IR system a query returns the following answer set containing ten documents. Ranking Recall Precision. d 5 % %. d % 5%. d 7 % % 4. d % 5% 5. d 8 4% 4% 6. d 9 4% % 7. d 4 6% 4% 8. d 5 6% 8% 9. d 6 6% %. d 6 6% % Which of the documents in this list are relevant? How many relevant documents are not returned by the system? 6. (5 marks) An IR system uses a Gamma code to encode postings lists. The following bitstring is returned by an inverted file index as a gap encoded list of DocIDs. Is this a sequence properly encoded in Gamma code? If yes, write down the decoded list of DocIDs. If no, briefly explain why this bitstring is not a correct Gamma code. c Birkbeck College COIY64H7 (Page of 5)

7. ( marks) There is a variant of the Levenshtein distance called Damerau-Levenshtein distance, which allows one more edit operation (in addition to insert, delete, and replace): a transposition of two adjacent characters. For example, to turn liter into litre would require one transposition operation, switching the positions of the letters e and r. Comparing this to the computation of the minimum cost of a cell in the Levenshtein distance algorithm, what modifications do you have to make to this calculation to implement a Damerau-Levenshtein distance? 8. (8 marks) Given the vectors for the query q and the two documents d and d as follows q = d = d = we see that the user is interested in documents containing the terms t, t 4, and t 6. (a) (b) (c) Which of the above documents are in the answer set of the Boolean query t AND t 4 AND t 6? Which of the above documents are in the answer set of the Boolean query t OR t 4 OR t 6? Which of the above documents has a higher ranking if you apply the cosine measure? Do not apply any TF-IDF weighting, but do apply normalisation. (4 marks) 9. ( marks) An IR system uses Rocchio s algorithm for relevance feedback. After a first iteration, the documents d and d are marked as relevant by the user, while the documents d, d 4, and d 5 are marked as irrelevant. The vectors for the query q and the documents are given below: q = d = 4 d = d = As a reminder, the formula used in Rocchio s algorithm is: q m = α q + β D r dj γ D nr d j D r d4 = d j D nr dj 4 5 d5 = (a) Assume that the parameters used in this first iteration are α =, β =.5, and γ =.5. What is the modified query vector used for the second round? (6 marks) (b) Does it make sense to adjust the parameters α, β, and γ in between iterations? If yes, briefly explain how the parameters would be modified. If no, briefly explain your reasoning. (4 marks) c Birkbeck College COIY64H7 (Page of 5)

. (5 marks) Briefly explain how the language modelling approach to IR can capture the intuition that terms with high tf-idf scores in the vector space model should influence the ranking more than other terms.. ( marks) Suppose the document collection consists of the following two documents. d : We know which university is the best university. d : Georgia Tech is the best university in Georgia. Suppose the query q is best university. Show how the above documents should be ranked using MLE unigram language models from the documents and the collection, mixed with λ =.6 as the weight for document models.. ( marks) Consider the following collection of documents that belong to two classes: amphibian (A) and non-amphibian (B). docid doctext class TRAINING d red frogs eat red flies A d green frogs eat flies A d green lizards eat all flies B TEST d 4 red green green? Show how the Naive Bayes algorithm (with Laplace smoothing) can be used to train a classifier and predict the class of test document.. (5 marks) Compare the speed of two learning algorithms for text classification: Naive Bayes (NB) and k-nearest-neighbours (knn). Which is usually faster with respect to their training time? Why? Which is usually faster in classifying a new, previously unseen document? Why? 4. (5 marks) The following table shows the result of flat clustering on a document collection, where each letter A, B or C represents a document in the true class A, B or C respectively. cluster cluster cluster A A A B C C A B C C A B B C What is the purity of the above clustering? 5. (5 marks) Given a set of five points {,,, 4, 5} on an axis, what clusters will -means clustering produce if 4 is chosen as the initial seed for one cluster and 5 is chosen as the initial seed for the other cluster? Use Euclidean distance. For example, the distance between points and 5 is. c Birkbeck College COIY64H7 (Page 4 of 5)

6. (5 marks) Given a set of five points {,,, 6, 9} on an axis, perform single-link Hierarchical Agglomerative Clustering (HAC) and draw the generated dendrogram. Use Euclidean distance. For example, the distance between points and 5 is. c Birkbeck College COIY64H7 (Page 5 of 5)