Exercise 2. and the following set of terms: Compute, using the boolean model, what documents satisfy the query

Similar documents
Session 2: Models, part 1

Exam IST 441 Spring 2014

Exam IST 441 Spring 2013

This lecture: IIR Sections Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes Vector space scoring

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

A Modified Vector Space Model for Protein Retrieval

Multimedia Information Extraction and Retrieval Term Frequency Inverse Document Frequency

Nash Equilibrium Load Balancing

CSE548, AMS542: Analysis of Algorithms, Fall 2012 Date: October 16. In-Class Midterm. ( 11:35 AM 12:50 PM : 75 Minutes )

Information Retrieval

Distributed Systems Final Exam

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Query Evaluation Strategies

Instructor: Stefan Savev

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Notes: Notes: Primo Ranking Customization

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Information Retrieval

(Refer Slide Time: 01.26)

Lesson 10: Representing, Naming, and Evaluating Functions

CMPSCI 105: Lecture #12 Searching, Sorting, Joins, and Indexing PART #1: SEARCHING AND SORTING. Linear Search. Binary Search.

Peer-to-peer systems and overlay networks

FUtella Analysis and Implementation of a Content- Based Peer-to-Peer Network

Overview Computer Networking Lecture 16: Delivering Content: Peer to Peer and CDNs Peter Steenkiste

68A8 Multimedia DataBases Information Retrieval - Exercises

Data Structure III. Segment Trees

Events in AnyLogic. Nathaniel Osgood MIT

Query Evaluation Strategies

CS 245 Midterm Exam Solution Winter 2015

CS161 - Final Exam Computer Science Department, Stanford University August 16, 2008

vector space retrieval many slides courtesy James Amherst

IMPROVING THE RELEVANCY OF DOCUMENT SEARCH USING THE MULTI-TERM ADJACENCY KEYWORD-ORDER MODEL

Designing Peer-to-Peer Systems for Business-to-Business Environments

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Recommender Systems 6CCS3WSN-7CCSMWAL

Introduction to Information Retrieval

Lesson 19. Opening Discussion

CS 337 Project 1: Minimum-Weight Binary Search Trees

Lymm High School- KS3 Life after levels - Computing & ICT

COMP-202 Unit 4: Programming with Iterations

A Vector Space Equalization Scheme for a Concept-based Collaborative Information Retrieval System

Information Retrieval

Interface. Dispatcher. Meta Searcher. Index DataBase. Parser & Indexer. Ranker

CS 385 Operating Systems Fall 2011 Homework Assignment 4 Simulation of a Memory Paging System

P2P technologies, PlanetLab, and their relevance to Grid work. Matei Ripeanu The University of Chicago

Customization Tools: Exercises

Modern information retrieval

4. Write a sum-of-products representation of the following circuit. Y = (A + B + C) (A + B + C)

Questions. 6. Suppose we were to define a hash code on strings s by:

Enterprise Architecture Modelling and Analysis with ArchiMate. Enterprise Architecture Practitioners Conference London, April 30, 2009

Information Retrieval. (M&S Ch 15)

Name Student ID Department/Year. Midterm Examination. Introduction to Computer Networks Class#: 901 E31110 Fall 2007

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Routing and Ad-hoc Retrieval with the. Nikolaus Walczuch, Norbert Fuhr, Michael Pollmann, Birgit Sievers. University of Dortmund, Germany.

ENCM 501 Winter 2018 Assignment 2 for the Week of January 22 (with corrections)

PRODUCT SEARCH OPTIMIZATION USING GENETIC ALGORITHM

Performance Evaluation

CS150 - Sample Final

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Writing and Graphing Linear Equations. Linear equations can be used to represent relationships.

ENCM 501 Winter 2016 Assignment 1 for the Week of January 25

COS 116 The Computational Universe Laboratory 8: Digital Logic II

CS1800 Discrete Structures Fall 2016 Profs. Aslam, Gold, Ossowski, Pavlu, & Sprague December 16, CS1800 Discrete Structures Final

CS34800, Fall 2016, Assignment 4

Chapter 10 Part 1: Reduction

Optimization in One Variable Using Solver

BVRIT HYDERABAD College of Engineering for Women. Department of Computer Science and Engineering. Course Hand Out

Informa(on Retrieval

Processor. Lecture #2 Number Rep & Intro to C classic components of all computers Control Datapath Memory Input Output

Motivation for B-Trees

Lecture 1 Contracts : Principles of Imperative Computation (Fall 2018) Frank Pfenning

A Complete Course. How a comprehensive marketing strategy can help your company win more business and protect its brand.

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

Paradigms for Parallel Algorithms

9. The Disorganized Handyman

Engr. Joseph Ronald Canedo's Note 1

Query Answering Using Inverted Indexes

Knowledge Discovery and Data Mining 1 (VO) ( )

CS/INFO 1305 Information Retrieval

USPS 2015 Rate Change Update Version

Where Should the Bugs Be Fixed?

Informa(on Retrieval

YEAH 2: Simple Java! Avery Wang Jared Bitz 7/6/2018

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Digital Libraries: Language Technologies

Peer-to-Peer Systems Exercise Winter Term 2014/2015

CS 2223 B15 Term. Homework 1 (100 pts.)

CSE100. Advanced Data Structures. Lecture 8. (Based on Paul Kube course materials)

Midterm Exam Search Engines ( / ) October 20, 2015

COMP combinational logic 1 Jan. 18, 2016

Lecture 15 Software Testing

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

15-110: Principles of Computing, Spring 2018

COMP6237 Data Mining Searching and Ranking

Week - 03 Lecture - 18 Recursion. For the last lecture of this week, we will look at recursive functions. (Refer Slide Time: 00:05)

OOP Design Conclusions and Variations

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Today s topic CS347. Results list clustering example. Why cluster documents. Clustering documents. Lecture 8 May 7, 2001 Prabhakar Raghavan

IT2.weebly.com Applied ICT 9713

Transcription:

CAIM, Cerca i Anàlisi d'informació Massiva Facultat d'informàtica de Barcelona, Fall 2015 Exercises Tema 2: Models of Information Retrieval Basic comprehension questions. Check that you can answer them without looking at the course slides. If you can't, read the slides and then try again. 1. True or false: The boolean model does not rank documents in the answer, while the vectorial model produces a ranking. 2. Can documents in the boolean model be regarded as vectors? Can AND, OR, and BUTNOT be described as vector operations? 3. Suppose you are given the frequency of every term in a given document. What other information do you need to compute its representation in tf-idf weights? 4. Hide the course slides. Write down the formula of the cosine measure of document similarity. Make sure you understand what every symbol means. Now look at the slides. Check your answer. Repeat until correct. 5. Same as above for the tf-idf weight assignment scheme. Consider the following documents: Exercise 1 D 1 : Shipment of gold damaged in a re D 2 : Delivery of silver arrived in a silver truck D 3 : Shipment of gold arrived in a truck 1

and the following set of terms: T = {re, gold, silver, truck}. Compute, using the boolean model, what documents satisfy the query (fire OR gold) AND (truck OR NOT silver) and justify your answer. Do the same with the query (fire OR NOT silver) AND (NOT truck OR NOT fire). Argue whether it is possible to rewrite these queries using only the operators AND, OR and BUTNOT in a logically equivalent way. This means that it must be equivalent for all possible document collections, not just this one. Exercise 2 Consider the following collection of ve documents: Doc1: we wish eciency in the implementation for a particular application Doc2: the classication methods are an application of Li's ideas Doc3: the classication has not followed any implementation pattern Doc4: we have to take care of the implementation time and implementation eciency Doc5: the eciency is in terms of implementation methods and application methods Assuming that every word with 6 or more letters is a term, and that terms are ordered in order of appearance, 1. Give the representation of each document in the boolean model. 2. Give the representation in the vector model using tf-idf weights of documents Doc1 and Doc5. of documents Doc1 and Doc5. Compute the similarity coecient, using the cosine measure, among these two documents. 2

Exercise 3 We have indexed a collection of documents containing the terms of the following table; the second column indicates the percentage of documents in which each term appears. Term % docs computer 10% software 10% bugs 5% code 2% developer 2% programmers 2% Given the query Q=computer software programmers, compute the similarity between Q and the following documents, if we use tf-idf weights for the document, binary weights for the query, and the cosine measure. Determine their relative ranking: D1 = programmers write computer software code D2 = most software has bugs, but good software has less bugs than bad software D3 = some bugs can be found only by executing the software, not by examining the source code (Answer: I get the similarities to be 0.966, 0.436, 0.244.) Exercise 4 Suppose that terms A, B, C, and D appear, respectively, in 10,000, 8,000, 5,000, and 3,000 documents of a collection of 100,000. 1. Consider the boolean query (A and B) or (C and D). How large can the answer to this query be, in the worst case? 2. And for the query (A and B) or (A and D)? Think carefully. 3. Compute the similarity of the documents A B B A C C and D A D B B C C using tf-idf weighting and the cosine measure. (Answers: 1) 11.000 2) 10.000 3) 0.736.) 3

Exercise 5 We have an indexed collection of one million documents that includes the following terms: Term # docs computing 300,000 networks 200,000 computer 100,000 les 100,000 system 100,000 client 80,000 programs 80,000 transfer 50,000 agents 40,000 p2p 20,000 applications 10,000 1. Compute the similarity between the following documents D1 and D2 using tf-idf weights and the cosine measure: D1 = "p2p programs help users sharing les, applications, other programs, etc. in computer networks" D2 = "p2p networks contain programs, applications, and also les" 2. Assume we are using the cosine measure and tf-idf weights to compute document similarity. Give a document containing two dierent terms exactly that achieves maximum similarity with the following document "p2p networks contain programs, applications, and also les" Compute this similarity and justify that it is indeed maximum among documents with two terms. (Answer to 1. 0.925.) Consider the following document: Exercise 6 Peer-to-peer (P2P) computing is the sharing of computer resources and services by direct collaboration between client systems. These resources and services often include the exchange 4

of information (Napster, Freenet, etc.), processing cycles (distributed.net, SETI@home, etc.), and disk storage for les (OceanStore, Farsite, etc.). Peer-to-peer computing takes advantage of existing desktop computing power and networking connectivity, allowing o-the-shelf clients to leverage their collective power beyond the sum of their parts. Current research on P2P has evolved from very dierent research areas. Among others, P2P has attracted the attention of researchers working on classical distributed computing, mobile agents, parallel computing, or communications. Very interestingly, the P2P paradigm is dierent from those studied in all these areas. For instance, while in some sense peer-to-peer computing is very similar to classical distributed computing (as opposed to the client-server paradigm), some new characteristics emerge. These include the clear and present danger of malicious peers, high churn rate (peers joining and leaving the system), among others. This text is part of a collection of one million documents that we have indexed. Assume that all documents and queries go through a reasonably eective stemmer, and that only the following terms have been included in the index. The index additionally stores the number of documents in which each term appears. Term documents comput 300901 network 200019 system 110990 client 80921 agent 42003 trac 40105 p2p 20909 peer 10979 1. Write the document's representation in the boolean model 2. Write the document's representation in the vector model with tf-idf weighting. 3. Compute the similarity of the document with the query p2p computer systems using binary weights for the query and the cosine measure. Repeat the same with the query client network traffic. 5

Exercise 7 Consider the following collection of four documents: Doc1: Shared Computer Resources Doc2: Computer Services Doc3: Digital Shared Components Doc4: Computer Resources Shared Components Assuming each word is a term 1. Write the boolean model representation of document Doc3. 2. What documents are retrieved, with the boolean model, with the query Computer BUTNOT Components? 3. Compute the idf value of the terms Computer and Components. 4. Compute the vector model representation using tf-idf weights of the document collection. 5. Compute the similarity between the query Computer Components and Doc4, using binary weights for the query and cosine similarity. 6