Building a Search Engine: Part 2

Similar documents
Assignment 4. Overview. Prof. Stewart Weiss. CSci 335 Software Design and Analysis III Assignment 4

An Interactive Desk Calculator. Project P2 of. Common Lisp: An Interactive Approach. Stuart C. Shapiro. Department of Computer Science

CS 2604 Minor Project 1 DRAFT Fall 2000

THE WEB SEARCH ENGINE

CS 1803 Pair Homework 4 Greedy Scheduler (Part I) Due: Wednesday, September 29th, before 6 PM Out of 100 points

CS2110 Assignment 2 Lists, Induction, Recursion and Parsing, Summer

Programming Standards: You must conform to good programming/documentation standards. Some specifics:

Programming Assignment 1

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

CS 4240: Compilers and Interpreters Project Phase 1: Scanner and Parser Due Date: October 4 th 2015 (11:59 pm) (via T-square)

CMSC 201 Fall 2016 Lab 09 Advanced Debugging

CS 1044 Program 6 Summer I dimension ??????

A short introduction to the development and evaluation of Indexing systems

Unit Assessment Guide

11. a b c d e. 12. a b c d e. 13. a b c d e. 14. a b c d e. 15. a b c d e

Shell Script Programs with If Then Else Statements

CS 1803 Pair Homework 3 Calculator Pair Fun Due: Wednesday, September 15th, before 6 PM Out of 100 points

Indexing and Query Processing. What will we cover?

CS 2316 Individual Homework 4 Greedy Scheduler (Part I) Due: Wednesday, September 18th, before 11:55 PM Out of 100 points

CSCI1410 Fall 2017 Assignment 3: Knowledge Representation and Reasoning

1 CS580W-01 Quiz 1 Solution

CS 2604 Minor Project 1 Summer 2000

15-122: Principles of Imperative Computation, Spring 2016

Advanced Algorithms and Computational Models (module A)

CS : Programming for Non-majors, Fall 2018 Programming Project #2: Census Due by 10:20am Wednesday September

SFU CMPT 379 Compilers Spring 2018 Milestone 1. Milestone due Friday, January 26, by 11:59 pm.

Decision Logic: if, if else, switch, Boolean conditions and variables

COMP 412, Fall 2018 Lab 1: A Front End for ILOC

Project 2: Scheme Lexer and Parser

Chapter 2. Architecture of a Search Engine

Fundamentals of Python: First Programs. Chapter 4: Strings and Text Files

CSCI 3155: Lab Assignment 6

such internal data dependencies can be formally specied. A possible approach to specify

CIT 590 Homework 5 HTML Resumes

ECS 189G: Intro to Computer Vision, Spring 2015 Problem Set 3

CSCI 3155: Lab Assignment 2

Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval

The SPL Programming Language Reference Manual

Project 2: CPU Scheduling Simulator

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

CS 3360 Design and Implementation of Programming Languages. Exam 1

CSCI6900 Assignment 3: Clustering on Spark

Unate Recursive Complement Algorithm

Query Processing and Alternative Search Structures. Indexing common words

DEMO A Language for Practice Implementation Comp 506, Spring 2018

The PCAT Programming Language Reference Manual

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

CSSE 304 Assignment #13 (interpreter milestone #1) Updated for Fall, 2018

Assignment 2. CS 234 Fall 2018 Sandy Graham. Create()

CSC148, Lab #4. General rules. Overview. Tracing recursion. Greatest Common Denominator GCD

Revised as of 2/26/05. This version replaces the version previously distributed in class

Full-Text Indexing For Heritrix

Spring 2018 PhD Qualifying Exam in Languages

CS 3360 Design and Implementation of Programming Languages. Exam 1

CS : Programming for Non-majors, Summer 2007 Programming Project #2: Census Due by 12:00pm (noon) Wednesday June

CSCI 4210 Operating Systems CSCI 6140 Computer Operating Systems Homework 4 (document version 1.1) Sockets and Server Programming using C

CMSC 201 Spring 2017 Lab 01 Hello World

CS : Computer Programming, Spring 2003 Programming Project #5: Sports Scores Due by 10:20am Wednesday April

YOUR NAME PLEASE: *** SOLUTIONS ***

COMP251: Algorithms and Data Structures. Jérôme Waldispühl School of Computer Science McGill University

Textbook. Topic 8: Files and Exceptions. Files. Types of Files

CSCI1410 Fall 2018 Assignment 3: Knowledge Representation and Reasoning

CS/IT 114 Introduction to Java, Part 1 FALL 2016 CLASS 3: SEP. 13TH INSTRUCTOR: JIAYIN WANG

COMP2100/2500 Lecture 17: Shell Programming II

The assignment is due by 1:30pm on Tuesday 5 th April, 2016.

Programming Project 1: Introduction to the BLITZ Tools

Jim Lambers ENERGY 211 / CME 211 Autumn Quarter Programming Project 4

CS3: Introduction to Symbolic Programming. Lecture 5:

Do not write in this area. Style (10) TOTAL. Maximum possible points: 30

Project 3: RPN Calculator

Homework 6: Higher-Order Procedures Due: 11:59 PM, Oct 16, 2018

CS : Programming for Non-Majors, Fall 2018 Programming Project #5: Big Statistics Due by 10:20am Wednesday November

Chapter 1 Summary. Chapter 2 Summary. end of a string, in which case the string can span multiple lines.

CSCI 3155: Lab Assignment 2

CS 2210 Sample Midterm. 1. Determine if each of the following claims is true (T) or false (F).

UNIVERSITY OF NEBRASKA AT OMAHA Computer Science 4500/8506 Operating Systems Fall Programming Assignment 1 (updated 9/16/2017)

68000 Assembler by Paul McKee. User's Manual

EECS 311: Data Structures and Data Management Program 1 Assigned: 10/21/10 Checkpoint: 11/2/10; Due: 11/9/10

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Fall 2015

28-Nov CSCI 2132 Software Development Lecture 33: Shell Scripting. 26 Shell Scripting. Faculty of Computer Science, Dalhousie University

CS646 (Fall 2016) Homework 1

CMSC 201 Spring 2018 Project 3 Minesweeper

CS664 Compiler Theory and Design LIU 1 of 16 ANTLR. Christopher League* 17 February Figure 1: ANTLR plugin installer

It is recommended that you submit this work no later than Tuesday, 12 October Solution examples will be presented on 13 October.

indexing and query processing. The inverted le was constructed for the retrieval target collection which contains full texts of two years' Japanese pa

Pizza Delivery Helper

Part III Appendices 165

Compiler Techniques MN1 The nano-c Language

Homework 8: Matrices Due: 11:59 PM, Oct 30, 2018

CS 330 Homework Comma-Separated Expression

CS 2505 Computer Organization I Test 1

Multiple-Key Indexing

6.00 Introduction to Computer Science and Programming Fall 2008

UNIX Shell Programming

DEBUGGING TIPS. 1 Introduction COMPUTER SCIENCE 61A

CS52 - Assignment 8. Due Friday 4/15 at 5:00pm.

CS143 Handout 05 Summer 2011 June 22, 2011 Programming Project 1: Lexical Analysis

An Introduction to Functions

Considerations for Constructing Twitter Queries in SMA

Laboratory Exercise #0

Transcription:

Brown University CSCI 1580 Spring 2013 Building a Search Engine: Part 2 Due: 2:00pm, 8 Mar 2013 Overview In this part of the Course Project you will enhance the simplied search engine that you built in Part 1. First, you will add the support for wildcard queries. Then, given a query, you will output a set of pages that answer this query, sorted according to a relevance estimate based on a vector space ranking model. Please read carefully the whole document before starting coding. If you are in doubt, ask the TAs. 1 Specication You will extend the two programs (one for building the inverted index and the second for answering queries using this index) that you coded in Part 1. 1.1 Building the Inverted Index As in Part 1, your rst program createindex will build an positional inverted index from a collection of documents. 1.1.1 Parsing the collection You will adopt the same parsing rules that were described in Section 1.1.1 of the specication for Part 1. 1.1.2 Inverted index format You can decide the format on disk of your inverted index and it should be contained in a single le, as in Part 1. After reading only the inverted index, you must be able to obtain the postings list associated with a term, with its position(s) within each document. Moreover, in view of the query procedure detailed in Section 1.2.4, you might want to clearly separate the dictionary of terms from the postings lists (so you can read only the former) and store the term weights in the index (instead of computing them at execution time). 1.1.3 Title index format You should produce a title index too. You can decide its format on disk, provided that you store the id and the original title of each page. Your title index must be contained in a single le.

1.1.4 Input/Output You will submit a bash script createindex.sh, suitably invoking your createindex program. Your bash script should accept four parameters, representing names of les: the list of stop words; the collection of pages; the inverted index to be built; the title index to be built. To test your program, we will invoke something like this: $./createindex.sh mystopwords.dat mycollection.dat myindex.dat mytitles.dat 1.2 Ranking of Search Results Your program queryindex will accept queries and output an ordered list of documents that answer each query, sorted using the procedure described below. 1.2.1 Query types and query parsing Your program will handle the four query types from Project Part 1 and two new types of queries: One word query (OWQ): a single word (e.g., Spartacus ); Free text query (FTQ): a sequence of at least two words, separated by a space (e.g., Orange Clockwork or Eyes Wide Shut ); Phrase query (PQ): a sequence of at least two words, separated by spaces and in double quotes (e.g., "Paths of Glory" or "Barry Lindon" ); Boolean query (BQ): any expression generated by the following grammar: boolquery word (boolquery) boolquery AND boolquery boolquery OR boolquery (e.g., Full OR Metal, 2001 AND Odyssey AND Space, (Fear AND Desire) OR Shining, ((Lolita OR Strangelove) AND (Start AND Stop)) ) Wildcard query (WQ): a single word, containing at least an asterisk (e.g., Oran* or Spar*c* ); Wildcard phrase query (WPQ): similar to PQs, but allowing wildcard words (e.g., or "Bar* Lin*d*" ). "Path* Glory" 2

Please note that word here means any string of alphanumeric or punctuation characters (excluding double quotes, reserved for PQs; parentheses, reserved for BQs; and asterisk, reserved for WQs) terminated by a blank space: this denition allows the user to input queries like "2001: A Space Odyssey" or Killer's Kiss. Only capitalized AND and OR act as operands for BQs, that is, you can assume that if a query contains AND or OR, they represent the logic operators. The precedence order is the standard one, from highest to lowest: rst ( and ), then AND, then OR. (Note that x OR y AND z must be parsed as x OR (y AND z).) You can assume that the PQs, BQs and WPQs are well-formed. In case a query is composed by a single word, you should consider it as a OWQ (or FTQ, as they are equivalent in this case) rather than a BQ. PQs dier from FTQs because they require each matching document to contain the terms corresponding to the words specied by the query, one next to the other and in the same order. BQs require the returned documents to contain terms corresponding to the boolean condition. The wildcard character (*) is expected to match any string of alphanumeric characters (hence terminated by a non-alphanumeric character, like a space or a punctuation character). For example, Bomb* matches Bomb, Bombs, Bombardier or Bomb2011. To simplify your work, you can assume that each wildcard word contains at most two wildcard characters, but they will not be contiguous (i.e., you might deal with Lol*t* but not with Strangel**ve ). 1.2.2 Ranking documents After parsing a query Q, you will have its type T and a stream of k terms t 1, t 2,..., t k, which are the stemmed version of the words contained in the query. You should rst determine the set of documents D Q matching the given query Q. This should be done according to the type of the query: if T is a OWQ, then D Q is the set of documents containing at least one word whose stemmed version is t 1 ; if T is a FTQ, then D Q is the set of documents containing at least one word whose stemmed version is one of the t i 's; if T is a PQ, then D Q is the set of documents containing words which, after the removal of stop words and after being stemmed, produce the subsequence t 1, t 2,..., t k in this order and with adjacent terms; if T is a BQ, then D Q is the set of documents containing words which, after being stemmed, satisfy the boolean condition; if T is a WQ, then D Q is the set of documents containing at least one match for the wildcard word; if T is a WPQ, then D Q is as for a PQ, but allowing wildcard matches. The last step is ranking the documents of D Q by their relevance to the query Q: you will write a procedure relying on the vector space model. Specically, you must use the Algorithm CosineScore of Figure 6.14 of the textbook, where the parameter q is the stream of terms t 1, t 2,..., t k mentioned above. 3

You will weight terms within documents and queries as illustrated in Example 6.4 of the textbook. The weight of a term t in a document (page) d is its tf, with the Euclidean normalization over term frequencies: wf t,d = tf t,d, t Td tf2 t,d where T d is the set of all terms appearing in document d. The weight of a term t appearing in a query q is its idf in the collection: 1.2.3 Weights for Wildcard Terms w t,q = idf t = log N df t. If the query contains wildcard words, you have to determine the weights as follows. Let T be the set of terms that match a wildcard word w. You need to consider the set DT = t T D t, that is, the documents which contain at least one term from T. The weight of w for a document d DT is dened as: wf w,d = max t wf t,d t T. For a query, you will use as weight of w the highest idf among all the terms in T, i.e. w w,q = max t idf t t T. Example: Let our dictionary contain the terms Manet and Monet with the following term idf's: and term-document wf's: idf Manet = 3, idf Monet = 2, wf Manet,0 = 0, wf Monet,0 = 0.5, wf Manet,1 = 1.5, wf Monet,1 = 0.2. Given a query q = M*net, we will adopt the following weights: 1.2.4 Input/Output wf M*net,0 = 0.5, wf M*net,1 = 1.5, w M*net,q = 3. You will submit a bash script queryindex.sh, suitably invoking your queryindex program. Your bash script should accept three parameters, representing names of les: the list of stop words; the inverted index; the title index. 4

Your program will read queries from the standard input, one at a time, and output to the standard output the list of the id's of the 10 most relevant documents. After retrieving and outputting the relevant documents for a query, your program will wait for another query, until the user closes the standard input with a CTRL+D. Please note that your program might be run in an interactive way. Reading all the queries rst and then producing all the results is not acceptable. More precisely, you should output the id's of the 10 documents with highest non-zero score, in decreasing order of score (highest rst), separated by a blank space. If there are less than 10 relevant documents, output them all. If there is no relevant document, output a blank line. For example, the following might be a typical execution: $./queryindex.sh mystopwords.dat myindex.dat mytitles.dat Space Odyssey 10 42 12 34 56 14 89 0 13 1 Space AND Odyssey 10 12 14 0 1 Odyssey AND Space 10 12 14 0 1 "2001: A Space Odyssey" 10 12 14 "2001: A Space Od*" 10 12 14 0 Titan* Full Metal Jacket 42 45 54 12 88 89 67 81 9 6 Note that the I/O interface specied above allows us to use redirection, as in the following example: $./queryindex.sh mystopwords.dat myindex.dat mytitles.dat < myqueries.dat > mydocs.dat The number of lines of myqueries.dat and mydocs.dat should be the same. For this assignment (and the subsequent ones), your query queryindex program is not allowed to read all the inverted index in memory. Instead, you can only load the dictionary before starting answering the queries, and then you must fetch the relevant postings lists directly from disk. Loading the dictionary should take no more than 3 minutes (for the full collection) and you should use a tree structure (e.g., a B-tree) to manipulate the dictionary once it is in memory. In particular, you should be able to handle wildcards eciently. If you code in Python, you can use the /course/cs158/src/lib/btrees library. Your program should answer each query in less than 10 seconds, including fetching the relevant postings lists from disk. (The 10 seconds rule applies to reasonable queries, like the sample ones that we will provide. Of course, we will not expect your code to answer queries like a* in such a short time!) 2 Data For this assignment you will need the following les located in /course/cs158/data/part2/ : 5

testcollection.dat : the test collection, containing roughly 10% of the full collection (33 MB); fullcollection.dat : the full collection (warning: 340MB!); queries : directory with some sample queries on the full collection; stopwords.dat : the list of stop words; createindex.sh : sample Bash script to invoke your createindex program; queryindex.sh : sample Bash script to invoke your queryindex program; readme.txt : a plain text le you should ll in with your information. You can nd an implementation of the Porter stemming algorithm at http://tartarus.org/~martin/ PorterStemmer/. Moreover, if you are coding in Python, you can use our parser of boolean expressions which you can nd in /course/cs158/src/lib/bool_parser. To assist you in checking that your programs run correctly, we will provide you with the expected results for some of the queries on the collection. We will release this expected output a few days past the publication of this specication, announcing the details via email. 3 Report Please include a report in PDF format answering the questions below. When analyzing your data structures and algorithms, please adopt the big-o notation instead of generic very long or very big, parameterizing your analysis according to the relevant quantities (number of documents/terms/etc.). Please make sure that your answer is within the sentence limit, we do not expect more than that, less is ne as long as it is complete. Be concrete, precise and concise. Please do not answer questions that we do not ask! If you are not sure, ask the TAs. 1. Describe the format of your inverted index justifying your choice. What is its total size on disk for the given collection of documents, in absolute terms (number of bytes) and as % of the size of the original collection le. (5 sentences) 2. What is the time and space complexity of your createindex program? Explain your answer. (5 sentences) 3. At execution time, when and how does your queryindex program access the inverted index? (3 sentences) 4. Describe the data structures that your queryindex program uses for manipulating the dictionary (once it is in memory), specifying how you use them to answer wildcard queries. What is the time complexity of answering a wildcard query (without taking the ranking into account)? (7 sentences) 5. Try your search engine on the given collection with your own queries, a few of each type. Which type of queries does it perform best/worst on, in terms of result relevance? How does this depend on the query words (common/uncommon/specic/etc.)? Provide at least 2 examples for each query type, one positive and one negative. 6

4 Collaboration Policy You can work on the Course Project in pairs, with the same team-mate with whom you worked on Part 1 of the Course Project. Only one submission per team is required. Please email cs158tas at cs dot brown dot edu if you have any questions. 5 Submission Please copy all the les mentioned below into a separate directory, cd there, and run the handin script from that directory. Be aware that since the handin script copies recursively all the les from the directory it is executed from, if you run the script from, say, your home directory, it will handin all your les and directories! The directory from where you run the handin script should contain only: Any source code you wrote and any external module (e.g.,.py/.pyc) you need to run your code. The two Bash scripts createindex.sh and queryindex.sh, suitably modied to call your programs. The plain text le readme.txt, lled in with the required information. (Please do not edit the structure of this le (i.e. FULL_NAME, etc.), just substitute our data with yours!) Your report, in PDF format, named report.pdf. Please submit only the above les (do not submit your index, the collection, or your output!) using the following command: $ /course/cs158/bin/cs158-handin part2 We suggest you to create an empty directory, copy all the above les there, and then create a copy of the entire directory. Move to the latter and try to execute your bash scripts, in order to check that you do not forget to include les necessary to the execution of your programs. If your programs run smoothly, then you can submit your les from the clean directory. 6 Evaluation We will run your programs on a CS machine with the full collection of documents, issuing queries that might or might not be in the sample set. We will verify that your programs produce reasonable results, looking at the retrieved documents. We will compare them with an external reference search engine, and we will judge both the relevance of the retrieved documents as well as the quality of the ranking (i.e., their relative order). It is mandatory that your programs adhere to the I/O interface described in this specication. In particular, check for unwanted prompts, debugging messages, newlines that you might have forgotten in your code. It is your responsibility to make sure that your code runs on the department machines and satises naming and usage criteria described in this document. We will only evaluate submissions that respect the I/O specication. Moreover, we expect that your programs will run without any syntax or runtime errors. Your grade will depend on: 7

The algorithms/data structure you adopted. The size of the inverted index generated by your The quality of the rankings produced by your The quality of both your code and report. createindex queryindex Have fun! 8 program. program and its speed.