Search Engines Exercise 5: Querying. Dustin Lange & Saeedeh Momtazi 9 June 2011

Similar documents
Search Engines Exercise 2: Crawling. Dustin Lange & Saeedeh Momtazi 28 April 2011

A short introduction to the development and evaluation of Indexing systems

Development of Search Engines using Lucene: An Experience

Information Retrieval

Information Retrieval

VK Multimedia Information Systems

Natural Language Processing

Applied Databases. Sebastian Maneth. Lecture 11 TFIDF Scoring, Lucene. University of Edinburgh - February 26th, 2017

Apache Lucene - Overview

EPL660: Information Retrieval and Search Engines Lab 2

Information Retrieval

Query Suggestion. A variety of automatic or semi-automatic query suggestion techniques have been developed

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Indexing and Search with

Lab A: Using Windows 2000 Help

Query Refinement and Search Result Presentation

The Lucene Search Engine

Web Data Management. Text indexing with LUCENE (Nicolas Travers) Philippe Rigaux CNAM Paris & INRIA Saclay

Chapter 2. Architecture of a Search Engine

Apache Lucene 4. Robert Muir

Search Engine Architecture II

Chapter 6. Queries and Interfaces

LucidWorks: Searching with curl October 1, 2012

Project Report on winter

LIST OF ACRONYMS & ABBREVIATIONS

Introduc)on to Lucene. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata

Search Evolution von Lucene zu Solr und ElasticSearch. Florian

AN EFFECTIVE SEARCH TOOL FOR LOCATING RESOURCE IN NETWORK

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Information Retrieval

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

CS6200 Information Retrieval. David Smith College of Computer and Information Science Northeastern University

Tutorial to QuotationFinder_0.4.3

Social Information Filtering

Recitation # Embedded System Engineering Friday 16-Oct-2015

Soir 1.4 Enterprise Search Server

Tutorial to QuotationFinder_0.4.4

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Tutorial to QuotationFinder_0.6

Relevancy Workbench Module. 1.0 Documentation

Chapter 6: Information Retrieval and Web Search. An introduction

Lucene Java 2.9: Numeric Search, Per-Segment Search, Near-Real-Time Search, and the new TokenStream API

CSC116: Introduction to Computing - Java

How A Website Works. - Shobha

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

AntMover 0.9 A Text Structure Analyzer

CSC116: Introduction to Computing - Java

Modern Information Retrieval

Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark

Question Answering Systems

Termin 6: Web Suche. Übung Netzbasierte Informationssysteme. Arbeitsgruppe. Prof. Dr. Adrian Paschke

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

Computer Science 572 Midterm Prof. Horowitz Tuesday, March 12, 2013, 12:30pm 1:45pm

Project Report. Project Title: Evaluation of Standard Information retrieval system related to specific queries

CCH China Law Express & China Law for Foreign Business. Participant Training Guide

Tangent-3 at the NTCIR-12 MathIR Task

CSCI 320 Group Project

Indexing and Query Processing. What will we cover?

Automated Tagging to Enable Fine-Grained Browsing of Lecture Videos

Digital Libraries: Language Technologies

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

LAB 3: Text processing + Apache OpenNLP

An Axiomatic Approach to IR UIUC TREC 2005 Robust Track Experiments

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

LUCENE - FIRST APPLICATION

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

How Does a Search Engine Work? Part 1

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Brainspace: Quick Reference

Hibernate Search: A Successful Search, a Happy User Make it Happen!

Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search

Goal of this document: A simple yet effective

Hybrid Approach for Query Expansion using Query Log

ClickLearn Studio. - A technical guide 9/18/2017

: Distributed Systems Principles and Paradigms Assignment 1 Multithreaded Dictionary Server

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Apache Lucene - Scoring

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) )

Natural Language Processing

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

CS646 (Fall 2016) Homework 1

CSc 2310 Principles of Programming (Java) Jyoti Islam

Updates on IT support for the IPC

Supporting Constructivist Learning in a Multimedia Presentation System

ASE 2017, Urbana-Champaign, IL, USA Technical Research /17 c 2017 IEEE

CS-E4420 Information Retrieval

Approach Research of Keyword Extraction Based on Web Pages Document

Tutorials. Tutorial every Friday at 11:30 AM in Toldo 204 * discuss the next lab assignment

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Lucene 4 - Next generation open source search

Relevance Feedback & Other Query Expansion Techniques

Information Retrieval

Welcome to the class of Web Information Retrieval!

Seznam.cz Fulltext Architecture

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Information Retrieval Exercises

Mining Web Data. Lijun Zhang

YAM++ Results for OAEI 2013

HTM, HTML, MHT, MHTML Web document Brightspace Learning Environment strips the <title> tag and text within the tag from user created web documents

Transcription:

Search Engines Exercise 5: Querying Dustin Lange & Saeedeh Momtazi 9 June 2011

Task 1: Indexing with Lucene We want to build a small search engine for movies Index and query the titles of the 100 best rated movies of all time from IMDb with Lucene

Lucene Components Indexing Directory Where to put the index: RAM / file system Document Entity to be indexed / searched Consists of fields Analyzer Prepare indexed text Tokenization, stemming, stop words, IndexWriter Create and maintain index Searching QueryParser Parse query, return query object IndexSearcher Access index with query object Collector Scoring results (Hits) and result filtering Source: http://blogs.msdn.com/b/windows-azure-support/archive/2010/11/01/how-to-use-lucene-net-in-windows-azure.aspx

Task 1: Indexing with Lucene Download Apache Lucene 3.1.0 http://lucene.apache.org/java/docs/index.html http://www.apache.org/dyn/closer.cgi/lucene/java/ Compile and run the given Java class Change the implementation so that a) Stop words are removed b) Stemming is done c) Suggestions for single-term queries are made (based on the indexed data) d) Suggestions for phrase queries are made (based on the indexed data) Which changes did you make? Briefly explain how your solution works (i.e., how Lucene s components work). How do the results of the given queries improve with each of these changes? (For the suggestions, compare the results of the original query with the results of the best suggestion.) You will need lucene-core, but also additional packages from the contrib folder (analyzers, spellchecker)

Relevance feedback User identifies relevant documents in the initial result list Modifying query using terms from those documents and re-ranks documents Pseudo-relevance feedback Task 2: Positional Pseudorelevance Feedback System assumes all top retrieved documents for initial query are relevant Expanding query based on the words that are co-occurred with query terms in top documents Localized/Positional pseudo-relevance feedback Intuition: words closer to query words are more likely to be related to the query topic Selecting from feedback documents those words that are focused on the query topic based on positions of terms in feedback documents Exploiting term positions and proximity to assign more weights to words closer to query words Co-occurrence counts are measured in a small window of terms around the query words

Task 2: Positional Pseudorelevance Feedback Use one of the following words as a query in google.com and assume the top 20 documents are relevant: Car Camel Golf Find top 15 words that are related to the query based on the positional pseudo-relevance model Compare the list of related words that are extracted based on the proposed variations (next slide)

Task 2: Positional Pseudorelevance Feedback Use window size 10 with the following term association measures: Dice s coefficient Mutual information Pearson s Chi-squared (χ2) measure Use window size 10 with the following localized measures (see next slide for examples): Uniform Weighted Use uniform model and mutual information with the following contexts: Whole document 5 words

Submissions & Next Exercise Submissions: Create slides to present your solution. Send your presentation as PDF or PPT(X) or ODP: SearchEngines5[Name1][Name2].[pdf ppt pptx odp] via e-mail with subject: Search Engines 5 to dustin (dot) lange (at) hpi (dot) until 04 July 2011, 5:00 pm On 05 July 2011 (Tuesday): Be prepared to present your solution English (or German) Absent: Send us an e-mail in advance

Thanks for Listening Updates Mailing list: searchengines2011 (at) hpi ( ) See website Questions Via e-mail: dustin (dot) lange (at) hpi ( ) saeedeh (dot) momtazi (at) hpi ( ) Office: A-1.6 / A-1.7