CS646 (Fall 2016) Homework 1

Size: px
Start display at page:

Download "CS646 (Fall 2016) Homework 1"

Transcription

1 CS646 (Fall 2016) Homework 1 Deadline: 11:59pm, Sep 28th, 2016 (EST) Access the following resources before you start working on HW1: Download the corpus file on Moodle: acm corpus.gz (about 90 MB). Check out the starter code on GitHub: cs646_hw1. You can directly import the Maven project in IntelliJ. 1 Working with Corpus and Index (50 points) The XML-like corpus file acm corpus.gz includes the information of about 270,000 documents. After decompression, you can read and process it as a plain text file (using the UTF-8 character encoding). Each document contains the metadata and abstract of an academic article in the following format: <DOC> <DOCNO>ACM </DOCNO> <TEXT> Relevance based language models Victor Lavrenko, W. Bruce Croft Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval Abstract: We explore the relation between classical probabilistic models of information retrieval and the emerging language modeling approaches. It has long been recognized that the primary obstacle to effective performance of classical models is the need to 1

2 estimate a relevance model: probabilities of words in the relevant class. We propose a novel technique for estimating these probabilities using the query alone. We demonstrate that our technique can produce highly accurate relevance models, addressing important notions of synonymy and polysemy. Our experiments show relevance models outperforming baseline language modeling systems on TREC retrieval and TDT tracking tasks. The main contribution of this work is an effective formal method for estimating a relevance model with no training data. </TEXT> </DOC> The <DOCNO> field is a unique ID (docno) of the document. For example, the docno of the above document is ACM The <TEXT> field contains the title, the authors, the publication venue, and the abstract of the article (without any structure). This format is usually called the trectext format. Note that it is only an XML-like format. The corpus file is not a valid XML document. So you may not be able to parse the corpus file using regular XML APIs. Anyhow, it is very easy to parse a trectext format file from scratch. 1.1 Processing the Corpus (20 points) Write a program to iteratively read each document and its <DOCNO> and <TEXT> fields from the corpus file. You are free to use any programming languages for this part. You should tokenize the text of the <TEXT> field by any characters that are neither alphabetic characters (a-z and A-Z) nor digits (0-9). You should normalize the tokenized words into lowercase characters. You do not need to stem the words or remove stop words in HW1. For example, your program should tokenize Victor Lavrenko, W. Bruce Croft into the following five tokens: victor lavrenko w bruce croft Count and report the following statistics of the corpus: The total number of documents in the corpus (2 points). 2

3 The average length 1 of the <TEXT> field for a document (3 points). The number of unique words appeared in the whole corpus (excluding the <DOCNO> field) (3 points). Find the longest 2 document(s) (excluding the <DOCNO> field). Report the docno(s) of the longest document(s) and the length (3 points). The document frequency 3 of information and retrieval (2 points). Compute and report the IDF of the two words as well (2 points). IDF(w) = log N n w N: the total number of documents in the corpus. n w : the number of documents containing the word w. Find all documents containing both query and reformulation. Report the number of documents containing both words (2 points) and report the docnos of these documents (2 points). How long does it take for your program to count all these statistics? (1 points) Include these statistics in your report. Submit your source code for counting these statistics to Moodle along with your report. 1.2 Building an Index (10 points) Read the Lucene and Galago tutorials on GitHub 4. Build an index for the corpus acm corpus.gz using either Lucene or Galago (you can choose one by your preference). Do NOT apply any stemming or remove stop words when you build the index. Read the requirements of 1.3 in advance and make sure the index you created can provide an efficient way to compute the statistics in The length of a document field refers to the total number of words in that field. 2 by the total number of words 3 The number of documents containing the word

4 If you use Lucene, you can use LuceneBuildIndex.java as the starter code. Submit your building index program along with your report. If you use Galago, report your command for building index (including all the parameters) in the report. You do NOT need to submit the index folder to Moodle. 1.3 Accessing Index (20 points) Write a program to compute the following statistics based on the index you built in 1.2. Note that it is okay if the statistics based on the index have a slight difference with those you counted by your own in 1.1 (because the tokenization process is different). The total number of documents in the corpus (2 points). The average length of the <TEXT> field for a document (3 points). The number of unique words appeared in the whole corpus (excluding the <DOCNO> field) (3 points). Find the longest document(s) (excluding the <DOCNO> field). Report the docno(s) of the longest document(s) and the length (3 points). The document frequency of information and retrieval (2 points). Compute and report the IDF of the two words as well (2 points). Retrieve the posting lists for query and reformulation. Merge them using a Boolean AND operation. Report the number of documents in the merged list (2 points) and the docnos of each document (4 points). How long does it take for your program to count all these statistics based on the index? (1 points) Include these statistics in your report. Submit your source code for counting these statistics to Moodle along with your report. 4

5 2 Ranking Results (25 points) Use GalagoSearchIndex.java or LuceneSearchIndex.java as the starter code to implement the following three retrieval models. Your implementation should retrieve the posting lists of query terms from index, and then merge and sort the posting lists into a ranked list of search results. In HW1, you do NOT need to consider how to efficiently merge the posting lists. However, at least you should get your final results based on posting lists rather than a brute-force scan of all the indexed documents. You implementation should be robust enough to handle queries of arbitrary length. Do not change the output part of the starter code. Boolean AND (5 points): finding results containing all query terms. As we mentioned in class, Boolean search returns a set of results without defining how to rank them. However, in your program, please rank the retrieved set of results by their docnos in an ascending order (this helps us with grading). TF IDF (10 points): score(q, d) = freq(w, d) log N w q n w freq(w, d) stands for the frequency of the word w in the document d. n w is the document frequency of w in the whole corpus. N is the total number of documents in the corpus. VSM (cosine similarity) (10 points): score(q, d) = w q freq(w, q) freq(w, d) w q freq(w, q) 2 w d freq(w, d) 2 freq(w, q) and freq(w, d), respectively, stand for the frequency of the word w in the query q and the document d. We also use q and d for the set of unique words in the query and the document, respectively. 5

6 After you have implemented the models, simply run the starter code s main function. It will retrieve the results for the query query reformulation and output the search results of the three models to three different files (with the name results BooleanAND, results TFIDF, and results VSMCosine). It will also print out the top 10 results of the three models to the console. Submit your implementation and the three search result files to Moodle. You also need to include in your report the output of the console (the top 10 results information for each model). 3 Evaluation (25 points) We performed a small-scale relevance judgments for the search query query reformulation on the corpus. We stored the judgments as a file qrels. You can find the file in HW1 s Github repository. Each line of the file stores the docno of a search result and its relevance score by our assessment (1 indicates relevant and 0 stands for not relevant). Use Evaluation.java as the starter code and implement the following three effectiveness evaluation measures. Simply consider the unjudged results as not relevant (score 0). Precision at rank k (P@k): the precision of the top k search results. (5 points) P@k = r k k r k : the number of relevant results among the top k results. Recall at rank k (Recall@k): the recall of the top k search results. (5 points) Recall@k = r k R r k : the number of relevant results among the top k results. R : the total number of relevant resutls in qrels. Average Precision (AP). You should measure the average precision of the whole ranked list (counting all retrieved search results). (10 points) 6

7 AP = 1 R n relevance(i) P@i i=1 n: the total number of results in the ranked list. relevance(i): the relevance score (1 or 0) of the ith result. R : the total number of relevant resutls in qrels. After you have implemented the three effectiveness measures, you can run Evaluation.java s main function. It will print out some evaluation results for the three search result files to the console. Include the reported evaluation information to your report and discuss the effectiveness of the three retrieval models based on the evaluation results. (5 points) A A Checklist for Your Submission A report including the answers to all questions (in pdf). All your source codes. Write a brief readme file for the purpose of the program if necessary. The three search result files. Pack all the stuff as a.zip or.tar.gz file and upload to Moodle s HW1 submission link (before the deadline!). If you decide to use the 5-day extension (you can only use it once during the whole semester), send an and let the instructor and the TA know. 7

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12 Fall 2016 CS646: Information Retrieval Lecture 2 - Introduction to Search Result Ranking Jiepu Jiang University of Massachusetts Amherst 2016/09/12 More course information Programming Prerequisites Proficiency

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running

More information

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags Hadi Amiri 1,, Yang Bao 2,, Anqi Cui 3,,*, Anindya Datta 2,, Fang Fang 2,, Xiaoying Xu 2, 1 Department of Computer Science, School

More information

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval 1 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, d j, for keyword vocabulary V. Convert

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

Lecture 5: Information Retrieval using the Vector Space Model

Lecture 5: Information Retrieval using the Vector Space Model Lecture 5: Information Retrieval using the Vector Space Model Trevor Cohn (tcohn@unimelb.edu.au) Slide credits: William Webber COMP90042, 2015, Semester 1 What we ll learn today How to take a user query

More information

UMass at TREC 2017 Common Core Track

UMass at TREC 2017 Common Core Track UMass at TREC 2017 Common Core Track Qingyao Ai, Hamed Zamani, Stephen Harding, Shahrzad Naseri, James Allan and W. Bruce Croft Center for Intelligent Information Retrieval College of Information and Computer

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback RMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback Ameer Albahem ameer.albahem@rmit.edu.au Lawrence Cavedon lawrence.cavedon@rmit.edu.au Damiano

More information

Where Should the Bugs Be Fixed?

Where Should the Bugs Be Fixed? Where Should the Bugs Be Fixed? More Accurate Information Retrieval-Based Bug Localization Based on Bug Reports Presented by: Chandani Shrestha For CS 6704 class About the Paper and the Authors Publication

More information

Informativeness for Adhoc IR Evaluation:

Informativeness for Adhoc IR Evaluation: Informativeness for Adhoc IR Evaluation: A measure that prevents assessing individual documents Romain Deveaud 1, Véronique Moriceau 2, Josiane Mothe 3, and Eric SanJuan 1 1 LIA, Univ. Avignon, France,

More information

Estimating Embedding Vectors for Queries

Estimating Embedding Vectors for Queries Estimating Embedding Vectors for Queries Hamed Zamani Center for Intelligent Information Retrieval College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 zamani@cs.umass.edu

More information

Verbose Query Reduction by Learning to Rank for Social Book Search Track

Verbose Query Reduction by Learning to Rank for Social Book Search Track Verbose Query Reduction by Learning to Rank for Social Book Search Track Messaoud CHAA 1,2, Omar NOUALI 1, Patrice BELLOT 3 1 Research Center on Scientific and Technical Information 05 rue des 03 frères

More information

Homework Assignment #3

Homework Assignment #3 CS 540-2: Introduction to Artificial Intelligence Homework Assignment #3 Assigned: Monday, February 20 Due: Saturday, March 4 Hand-In Instructions This assignment includes written problems and programming

More information

Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track

Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track Bi-directional Linkability From Wikipedia to Documents and Back Again: UMass at TREC 2012 Knowledge Base Acceleration Track Jeffrey Dalton University of Massachusetts, Amherst jdalton@cs.umass.edu Laura

More information

Multimedia Information Systems

Multimedia Information Systems Multimedia Information Systems Samson Cheung EE 639, Fall 2004 Lecture 6: Text Information Retrieval 1 Digital Video Library Meta-Data Meta-Data Similarity Similarity Search Search Analog Video Archive

More information

doi: / _32

doi: / _32 doi: 10.1007/978-3-319-12823-8_32 Simple Document-by-Document Search Tool Fuwatto Search using Web API Masao Takaku 1 and Yuka Egusa 2 1 University of Tsukuba masao@slis.tsukuba.ac.jp 2 National Institute

More information

Index Compression. David Kauchak cs160 Fall 2009 adapted from:

Index Compression. David Kauchak cs160 Fall 2009 adapted from: Index Compression David Kauchak cs160 Fall 2009 adapted from: http://www.stanford.edu/class/cs276/handouts/lecture5-indexcompression.ppt Administrative Homework 2 Assignment 1 Assignment 2 Pair programming?

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Results Exercise 01 Exercise 02 Retrieval

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Query Expansion for Noisy Legal Documents

Query Expansion for Noisy Legal Documents Query Expansion for Noisy Legal Documents Lidan Wang 1,3 and Douglas W. Oard 2,3 1 Computer Science Department, 2 College of Information Studies and 3 Institute for Advanced Computer Studies, University

More information

On Duplicate Results in a Search Session

On Duplicate Results in a Search Session On Duplicate Results in a Search Session Jiepu Jiang Daqing He Shuguang Han School of Information Sciences University of Pittsburgh jiepu.jiang@gmail.com dah44@pitt.edu shh69@pitt.edu ABSTRACT In this

More information

Turnitin assignments are added from the course s home page. To open the course home page, click on the course from the Moodle start page.

Turnitin assignments are added from the course s home page. To open the course home page, click on the course from the Moodle start page. Guides.turnitin.com Turnitin Assignment Assignment Submission Dates Submitting Papers on Behalf of Students Viewing the Turnitin Submission Inbox Updating a Turnitin Assignment 2 Turnitin Students Tab

More information

Northeastern University in TREC 2009 Million Query Track

Northeastern University in TREC 2009 Million Query Track Northeastern University in TREC 2009 Million Query Track Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, Stefan Savev, Javed Aslam Information Studies Department, University of Sheffield, Sheffield, UK College

More information

modern database systems lecture 4 : information retrieval

modern database systems lecture 4 : information retrieval modern database systems lecture 4 : information retrieval Aristides Gionis Michael Mathioudakis spring 2016 in perspective structured data relational data RDBMS MySQL semi-structured data data-graph representation

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Information Retrieval Potsdam, 14 June 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book Outline 2 1 Introduction 2 Indexing Block Document

More information

Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark

Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark Due: Sept. 27 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Lee Pallickara Web page:

More information

Programming Assignment 1

Programming Assignment 1 CS 276 / LING 286 Spring 2017 Programming Assignment 1 Due: Thursday, April 20, 2017 at 11:59pm Overview In this programming assignment, you will be applying knowledge that you have learned from lecture

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2015 Information Retrieval Dr. Mariana Neves June 22nd, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

Overview. Lab 2: Information Retrieval. Assignment Preparation. Data. .. Fall 2015 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Overview. Lab 2: Information Retrieval. Assignment Preparation. Data. .. Fall 2015 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. .. Fall 2015 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Due date: Thursday, October 8. Lab 2: Information Retrieval Overview In this assignment you will perform a number of Information

More information

Writeup for first project of CMSC 420: Data Structures Section 0102, Summer Theme: Threaded AVL Trees

Writeup for first project of CMSC 420: Data Structures Section 0102, Summer Theme: Threaded AVL Trees Writeup for first project of CMSC 420: Data Structures Section 0102, Summer 2017 Theme: Threaded AVL Trees Handout date: 06-01 On-time deadline: 06-09, 11:59pm Late deadline (30% penalty): 06-11, 11:59pm

More information

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL

CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LVII, Number 4, 2012 CLUSTERING, TIERED INDEXES AND TERM PROXIMITY WEIGHTING IN TEXT-BASED RETRIEVAL IOAN BADARINZA AND ADRIAN STERCA Abstract. In this paper

More information

CS-E4420 Information Retrieval

CS-E4420 Information Retrieval CS-E4420 Information Retrieval Course assignments 02-07-2017 Esko Ikkala Agenda General information about the course assignments Short demo: how to set up the necessary programming tools Assignments There

More information

Data Structure and Algorithm Homework #3 Due: 2:20pm, Tuesday, April 9, 2013 TA === Homework submission instructions ===

Data Structure and Algorithm Homework #3 Due: 2:20pm, Tuesday, April 9, 2013 TA   === Homework submission instructions === Data Structure and Algorithm Homework #3 Due: 2:20pm, Tuesday, April 9, 2013 TA email: dsa1@csientuedutw === Homework submission instructions === For Problem 1, submit your source code, a Makefile to compile

More information

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks

University of Amsterdam at INEX 2010: Ad hoc and Book Tracks University of Amsterdam at INEX 2010: Ad hoc and Book Tracks Jaap Kamps 1,2 and Marijn Koolen 1 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Faculty of Science,

More information

Term Frequency Normalisation Tuning for BM25 and DFR Models

Term Frequency Normalisation Tuning for BM25 and DFR Models Term Frequency Normalisation Tuning for BM25 and DFR Models Ben He and Iadh Ounis Department of Computing Science University of Glasgow United Kingdom Abstract. The term frequency normalisation parameter

More information

Annotated Suffix Trees for Text Clustering

Annotated Suffix Trees for Text Clustering Annotated Suffix Trees for Text Clustering Ekaterina Chernyak and Dmitry Ilvovsky National Research University Higher School of Economics Moscow, Russia echernyak,dilvovsky@hse.ru Abstract. In this paper

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Query Expansion using Wikipedia and DBpedia

Query Expansion using Wikipedia and DBpedia Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org

More information

SafeAssign. Blackboard Support Team V 2.0

SafeAssign. Blackboard Support Team V 2.0 SafeAssign By Blackboard Support Team V 2.0 1111111 Contents Introduction... 3 How it works... 3 How to use SafeAssign in your Assignment... 3 Supported Files... 5 SafeAssign Originality Reports... 5 Access

More information

Data Structure and Algorithm Homework #5 Due: 2:00pm, Thursday, May 31, 2012 TA === Homework submission instructions ===

Data Structure and Algorithm Homework #5 Due: 2:00pm, Thursday, May 31, 2012 TA   === Homework submission instructions === Data Structure and Algorithm Homework #5 Due: 2:00pm, Thursday, May 1, 2012 TA email: dsa1@csie.ntu.edu.tw === Homework submission instructions === For Problem 1, submit your source code, a shell script

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

CS54701: Information Retrieval

CS54701: Information Retrieval CS54701: Information Retrieval Basic Concepts 19 January 2016 Prof. Chris Clifton 1 Text Representation: Process of Indexing Remove Stopword, Stemming, Phrase Extraction etc Document Parser Extract useful

More information

A Text Retrieval Approach to Recover Links among s and Source Code Classes

A Text Retrieval Approach to Recover Links among  s and Source Code Classes 318 A Text Retrieval Approach to Recover Links among E-Mails and Source Code Classes Giuseppe Scanniello and Licio Mazzeo Universitá della Basilicata, Macchia Romana, Viale Dell Ateneo, 85100, Potenza,

More information

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _

Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ COURSE DELIVERY PLAN - THEORY Page 1 of 6 Department of Computer Science and Engineering B.E/B.Tech/M.E/M.Tech : B.E. Regulation: 2013 PG Specialisation : _ LP: CS6007 Rev. No: 01 Date: 27/06/2017 Sub.

More information

Carnegie Mellon University Database Applications Fall 2009, Faloutsos Assignment 5: Indexing (DB-internals) Due: 10/8, 1:30pm, only

Carnegie Mellon University Database Applications Fall 2009, Faloutsos Assignment 5: Indexing (DB-internals) Due: 10/8, 1:30pm,  only Carnegie Mellon University 15-415 - Database Applications Fall 2009, Faloutsos Assignment 5: Indexing (DB-internals) Due: 10/8, 1:30pm, e-mail only 1 Reminders Weight: 20% of the homework grade. Out of

More information

INSTRUCTOR - ONQ - ADD A TURNITIN DROPBOX TO CREATE A TURNITIN DROPBOX FOLDER IN ONQ

INSTRUCTOR - ONQ - ADD A TURNITIN DROPBOX TO CREATE A TURNITIN DROPBOX FOLDER IN ONQ INSTRUCTOR - ONQ - ADD A TURNITIN DROPBOX TO CREATE A TURNITIN DROPBOX FOLDER IN ONQ 1. Click the Assessment link on the navbar. 2. Scroll down and select Dropbox. 3. Click New Folder. 4. In the new folder

More information

Retrieval Evaluation

Retrieval Evaluation Retrieval Evaluation - Reference Collections Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, Chapter

More information

Session 10: Information Retrieval

Session 10: Information Retrieval INFM 63: Information Technology and Organizational Context Session : Information Retrieval Jimmy Lin The ischool University of Maryland Thursday, November 7, 23 Information Retrieval What you search for!

More information

A short introduction to the development and evaluation of Indexing systems

A short introduction to the development and evaluation of Indexing systems A short introduction to the development and evaluation of Indexing systems Danilo Croce croce@info.uniroma2.it Master of Big Data in Business SMARS LAB 3 June 2016 Outline An introduction to Lucene Main

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

CS 341 Spring 2014 HW #8. Policy: Individual *or* teams of 2 electronic via Blackboard. Assignment. N-tier Design

CS 341 Spring 2014 HW #8. Policy: Individual *or* teams of 2 electronic via Blackboard. Assignment. N-tier Design HW #8 Complete By: Wednesday April 9 th @ 9:00pm Policy: Individual *or* teams of 2 Submission: electronic via Blackboard CS 341 Spring 2014 Assignment The previous homework (HW7) focused on building a

More information

INSTRUCTOR - ONQ - ADD A TURNITIN DROPBOX TO CREATE A TURNITIN DROPBOX FOLDER IN ONQ

INSTRUCTOR - ONQ - ADD A TURNITIN DROPBOX TO CREATE A TURNITIN DROPBOX FOLDER IN ONQ INSTRUCTOR - ONQ - ADD A TURNITIN DROPBOX TO CREATE A TURNITIN DROPBOX FOLDER IN ONQ 1. Click the Assessment link on the navbar. 2. Scroll down and select Dropbox. 3. Click New Folder. 4. In the new folder

More information

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Yikun Guo, Henk Harkema, Rob Gaizauskas University of Sheffield, UK {guo, harkema, gaizauskas}@dcs.shef.ac.uk

More information

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood

TREC-3 Ad Hoc Retrieval and Routing. Experiments using the WIN System. Paul Thompson. Howard Turtle. Bokyung Yang. James Flood TREC-3 Ad Hoc Retrieval and Routing Experiments using the WIN System Paul Thompson Howard Turtle Bokyung Yang James Flood West Publishing Company Eagan, MN 55123 1 Introduction The WIN retrieval engine

More information

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data

Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data American Journal of Applied Sciences (): -, ISSN -99 Science Publications Designing and Building an Automatic Information Retrieval System for Handling the Arabic Data Ibrahiem M.M. El Emary and Ja'far

More information

Self assessment due: Monday 10/29/2018 at 11:59pm (submit via Gradescope)

Self assessment due: Monday 10/29/2018 at 11:59pm (submit via Gradescope) CS 188 Fall 2018 Introduction to Artificial Intelligence Written HW 7 Due: Monday 10/22/2018 at 11:59pm (submit via Gradescope). Leave self assessment boxes blank for this due date. Self assessment due:

More information

CSE Theory of Computing Fall 2017 Project 3: K-tape Turing Machine

CSE Theory of Computing Fall 2017 Project 3: K-tape Turing Machine CSE 30151 Theory of Computing Fall 2017 Project 3: K-tape Turing Machine Version 1: Oct. 23, 2017 1 Overview The goal of this project is to have each student understand at a deep level the functioning

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

CSCI544, Fall 2016: Assignment 2

CSCI544, Fall 2016: Assignment 2 CSCI544, Fall 2016: Assignment 2 Due Date: October 28 st, before 4pm. Introduction The goal of this assignment is to get some experience implementing the simple but effective machine learning model, the

More information

Lab 2 Test collections

Lab 2 Test collections Lab 2 Test collections Information Retrieval, 2017 Goal Introduction The objective of this lab is for you to get acquainted with working with an IR test collection and Lemur Indri retrieval system. Instructions

More information

CS2223: Algorithms D- Term, Homework I. Teams: To be done individually. Due date: 03/27/2015 (1:50 PM) Submission: Electronic submission only

CS2223: Algorithms D- Term, Homework I. Teams: To be done individually. Due date: 03/27/2015 (1:50 PM) Submission: Electronic submission only CS2223: Algorithms D- Term, 2015 Homework I Teams: To be done individually Due date: 03/27/2015 (1:50 PM) Submission: Electronic submission only 1 General Instructions Python Code vs. Pseudocode: Each

More information

CS34800, Fall 2016, Assignment 4

CS34800, Fall 2016, Assignment 4 1 CS34800, Fall 2016, Assignment 4 Due 11:59pm 07 December (Wed.), 2016 * if you submit it by Tuesday, Dec. 06 then it will be graded and returned back to you after lecture on Friday, Dec. 09. If you submit

More information

Similarity search in multimedia databases

Similarity search in multimedia databases Similarity search in multimedia databases Performance evaluation for similarity calculations in multimedia databases JO TRYTI AND JOHAN CARLSSON Bachelor s Thesis at CSC Supervisor: Michael Minock Examiner:

More information

Text Retrieval an introduction

Text Retrieval an introduction Text Retrieval an introduction Michalis Vazirgiannis Nov. 2012 Outline Document collection preprocessing Feature Selection Indexing Query processing & Ranking Text representation for Information Retrieval

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR Evaluation and IR Standard Text Collections Instructor: Rada Mihalcea Some slides in this section are adapted from lectures by Prof. Ray Mooney (UT) and Prof. Razvan

More information

Lecture 7: Relevance Feedback and Query Expansion

Lecture 7: Relevance Feedback and Query Expansion Lecture 7: Relevance Feedback and Query Expansion Information Retrieval Computer Science Tripos Part II Ronan Cummins Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk

More information

CSCI544, Fall 2016: Assignment 1

CSCI544, Fall 2016: Assignment 1 CSCI544, Fall 2016: Assignment 1 Due Date: September 23 rd, 4pm. Introduction The goal of this assignment is to get some experience implementing the simple but effective machine learning technique, Naïve

More information

DCU at FIRE 2013: Cross-Language!ndian News Story Search

DCU at FIRE 2013: Cross-Language!ndian News Story Search DCU at FIRE 2013: Cross-Language!ndian News Story Search Piyush Arora, Jennifer Foster, and Gareth J. F. Jones CNGL Centre for Global Intelligent Content School of Computing, Dublin City University Glasnevin,

More information

Real-time Query Expansion in Relevance Models

Real-time Query Expansion in Relevance Models Real-time Query Expansion in Relevance Models Victor Lavrenko and James Allan Center for Intellignemt Information Retrieval Department of Computer Science 140 Governor s Drive University of Massachusetts

More information

60-538: Information Retrieval

60-538: Information Retrieval 60-538: Information Retrieval September 7, 2017 1 / 48 Outline 1 what is IR 2 3 2 / 48 Outline 1 what is IR 2 3 3 / 48 IR not long time ago 4 / 48 5 / 48 now IR is mostly about search engines there are

More information

Learning to Reweight Terms with Distributed Representations

Learning to Reweight Terms with Distributed Representations Learning to Reweight Terms with Distributed Representations School of Computer Science Carnegie Mellon University August 12, 215 Outline Goal: Assign weights to query terms for better retrieval results

More information

Retrieval and Feedback Models for Blog Distillation

Retrieval and Feedback Models for Blog Distillation Retrieval and Feedback Models for Blog Distillation Jonathan Elsas, Jaime Arguello, Jamie Callan, Jaime Carbonell Language Technologies Institute, School of Computer Science, Carnegie Mellon University

More information

Focused Retrieval Using Topical Language and Structure

Focused Retrieval Using Topical Language and Structure Focused Retrieval Using Topical Language and Structure A.M. Kaptein Archives and Information Studies, University of Amsterdam Turfdraagsterpad 9, 1012 XT Amsterdam, The Netherlands a.m.kaptein@uva.nl Abstract

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught at UT Austin and Stanford) Information Retrieval

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task

Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Applying the KISS Principle for the CLEF- IP 2010 Prior Art Candidate Patent Search Task Walid Magdy, Gareth J.F. Jones Centre for Next Generation Localisation School of Computing Dublin City University,

More information

Component ranking and Automatic Query Refinement for XML Retrieval

Component ranking and Automatic Query Refinement for XML Retrieval Component ranking and Automatic uery Refinement for XML Retrieval Yosi Mass, Matan Mandelbrod IBM Research Lab Haifa 31905, Israel {yosimass, matan}@il.ibm.com Abstract ueries over XML documents challenge

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Information Retrieval Information Retrieval (IR) is finding material of an unstructured

More information

CMPT 354 Database Systems. Simon Fraser University Fall Instructor: Oliver Schulte. Assignment 3b: Application Development, Chapters 6 and 7.

CMPT 354 Database Systems. Simon Fraser University Fall Instructor: Oliver Schulte. Assignment 3b: Application Development, Chapters 6 and 7. CMPT 354 Database Systems Simon Fraser University Fall 2016 Instructor: Oliver Schulte Assignment 3b: Application Development, Chapters 6 and 7. Instructions: Check the instructions in the syllabus. The

More information

Information Retrieval. Information Retrieval and Web Search

Information Retrieval. Information Retrieval and Web Search Information Retrieval and Web Search Introduction to IR models and methods Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent

More information

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου

ΕΠΛ660. Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Ανάκτηση µε το µοντέλο διανυσµατικού χώρου Σηµερινό ερώτηµα Typically we want to retrieve the top K docs (in the cosine ranking for the query) not totally order all docs in the corpus can we pick off docs

More information

The University of Illinois Graduate School of Library and Information Science at TREC 2011

The University of Illinois Graduate School of Library and Information Science at TREC 2011 The University of Illinois Graduate School of Library and Information Science at TREC 2011 Miles Efron, Adam Kehoe, Peter Organisciak, Sunah Suh 501 E. Daniel St., Champaign, IL 61820 1 Introduction The

More information

Information Retrieval

Information Retrieval Natural Language Processing SoSe 2014 Information Retrieval Dr. Mariana Neves June 18th, 2014 (based on the slides of Dr. Saeedeh Momtazi) Outline Introduction Indexing Block 2 Document Crawling Text Processing

More information

Evaluating a Conceptual Indexing Method by Utilizing WordNet

Evaluating a Conceptual Indexing Method by Utilizing WordNet Evaluating a Conceptual Indexing Method by Utilizing WordNet Mustapha Baziz, Mohand Boughanem, Nathalie Aussenac-Gilles IRIT/SIG Campus Univ. Toulouse III 118 Route de Narbonne F-31062 Toulouse Cedex 4

More information

Query Evaluation Strategies

Query Evaluation Strategies Introduction to Search Engine Technology Term-at-a-Time and Document-at-a-Time Evaluation Ronny Lempel Yahoo! Labs (Many of the following slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research

More information

Locating the Dropbox Tool:

Locating the Dropbox Tool: This step- by- step guide will demonstrate how to utilize the Dropbox Tool in your course in Desire2Learn (D2L). Locating the Dropbox Tool: 1. Go to the Course Navigation Bar and locate the Dropbox Tool.

More information

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser

Text Analytics. Index-Structures for Information Retrieval. Ulf Leser Text Analytics Index-Structures for Information Retrieval Ulf Leser Content of this Lecture Inverted files Storage structures Phrase and proximity search Building and updating the index Using a RDBMS Ulf

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval Skiing Seminar Information Retrieval 2010/2011 Introduction to Information Retrieval Prof. Ulrich Müller-Funk, MScIS Andreas Baumgart and Kay Hildebrand Agenda 1 Boolean

More information

Data Modelling and Multimedia Databases M

Data Modelling and Multimedia Databases M ALMA MATER STUDIORUM - UNIERSITÀ DI BOLOGNA Data Modelling and Multimedia Databases M International Second cycle degree programme (LM) in Digital Humanities and Digital Knoledge (DHDK) University of Bologna

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable References Bigtable: A Distributed Storage System for Structured Data. Fay Chang et. al. OSDI

More information

CS160 - Assignment 2 Due: Friday Sept. 25, 6pm

CS160 - Assignment 2 Due: Friday Sept. 25, 6pm CS160 - Assignment 2 Due: Friday Sept. 25, 6pm For the next step in our IR system we re going to be adding functionality to do boolean queries. For our purposes a boolean query consists of an expression.

More information

Boolean Model. Hongning Wang

Boolean Model. Hongning Wang Boolean Model Hongning Wang CS@UVa Abstraction of search engine architecture Indexed corpus Crawler Ranking procedure Doc Analyzer Doc Representation Query Rep Feedback (Query) Evaluation User Indexer

More information

CS 2223 B15 Term. Homework 1 (100 pts.)

CS 2223 B15 Term. Homework 1 (100 pts.) CS 2223 B15 Term. Homework 1 (100 pts.) Homework Instructions This homework is to be completed individually. If you have any questions as to what constitutes improper behavior, review the examples I have

More information

Practical Relevance Ranking for 10 Million Books.

Practical Relevance Ranking for 10 Million Books. Practical Relevance Ranking for 10 Million Books. Tom Burton-West Digital Library Production Service, University of Michigan Library, Ann Arbor, Michigan, US tburtonw@umich.edu Abstract. In this paper

More information

Turnitin in Moodle Guide for AUB faculty

Turnitin in Moodle Guide for AUB faculty Turnitin in Moodle Guide for AUB faculty Table of Contents Introduction... 2 Creating a Turnitin Assignment in Moodle... 2 Viewing Student Submissions and Originality reports... 4 Originality report Modes...

More information

CS261: HOMEWORK 2 Due 04/13/2012, at 2pm

CS261: HOMEWORK 2 Due 04/13/2012, at 2pm CS261: HOMEWORK 2 Due 04/13/2012, at 2pm Submit six *.c files via the TEACH website: https://secure.engr.oregonstate.edu:8000/teach.php?type=want_auth 1. Introduction The purpose of HW2 is to help you

More information