COMP Implemen0ng Search using Lucene

Similar documents
Information Retrieval

Introduc)on to Lucene. Debapriyo Majumdar Information Retrieval Spring 2015 Indian Statistical Institute Kolkata

Informa(on Retrieval

EPL660: Information Retrieval and Search Engines Lab 2

SEARCHING AND INDEXING BIG DATA. -By Jagadish Rouniyar

Information Retrieval

Information Retrieval

Lucene Java 2.9: Numeric Search, Per-Segment Search, Near-Real-Time Search, and the new TokenStream API

Applied Databases. Sebastian Maneth. Lecture 11 TFIDF Scoring, Lucene. University of Edinburgh - February 26th, 2017

LUCENE - FIRST APPLICATION

Lucene. Jianguo Lu. School of Computer Science. University of Windsor

Project Report on winter

Informa(on Retrieval. Introduc*on to. Lucene Tutorial

Web Data Management. Text indexing with LUCENE (Nicolas Travers) Philippe Rigaux CNAM Paris & INRIA Saclay

The Lucene Search Engine

Project Report. Project Title: Evaluation of Standard Information retrieval system related to specific queries

LUCENE - BOOLEANQUERY

LUCENE - TERMRANGEQUERY

Search Evolution von Lucene zu Solr und ElasticSearch. Florian

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Searching and Analyzing Qualitative Data on Personal Computer

Apache Lucene - Scoring

Development of Search Engines using Lucene: An Experience

LUCENE - DELETE DOCUMENT OPERATION

!"#$%&'()*+,-./'*.0'12*)$%-./'34'5# '/"-028'

Querying a Lucene Index

Building Search Applications

LUCENE - ADD DOCUMENT OPERATION

Apache Lucene - Overview

VK Multimedia Information Systems

Please post comments or corrections to the Author Online forum at

THE WEB SEARCH ENGINE

A short introduction to the development and evaluation of Indexing systems

Covers Apache Lucene 3.0 IN ACTION SECOND EDITION. Michael McCandless Erik Hatcher, Otis Gospodnetic F OREWORD BY D OUG C UTTING MANNING

Indexing and Searching Document Collections using Lucene

LucidWorks: Searching with curl October 1, 2012

Search Engines. Informa1on Retrieval in Prac1ce. Annotations by Michael L. Nelson

230 Million Tweets per day

AN EFFECTIVE SEARCH TOOL FOR LOCATING RESOURCE IN NETWORK

Informa(on Retrieval

Informa/on Retrieval. Text Search. CISC437/637, Lecture #23 Ben CartereAe. Consider a database consis/ng of long textual informa/on fields

Realtime Search with Lucene. Michael

Lucene 4 - Next generation open source search

Active Learning: Streams

Please post comments or corrections to the Author Online forum at

LUCENE - QUICK GUIDE LUCENE - OVERVIEW

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

/* Copyright 2012 Robert C. Ilardi

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Relevancy Workbench Module. 1.0 Documentation

BEST SEARCH AND RETRIEVAL PERFORMANCE EVALUATION WITH LUCENE INDEXING

Logistics. CSE Case Studies. Indexing & Retrieval in Google. Review: AltaVista. BigTable. Index Stream Readers (ISRs) Advanced Search

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Java Programming Unit 7. Error Handling. Excep8ons.

Computer Science 572 Exam Prof. Horowitz Tuesday, April 24, 2017, 8:00am 9:00am

EECS 395/495 Lecture 3 Scalable Indexing, Searching, and Crawling

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

ER/Studio Enterprise Portal 1.1 New Features Guide

Search Engines Exercise 5: Querying. Dustin Lange & Saeedeh Momtazi 9 June 2011

Information Retrieval

Introduc)on to. CS60092: Informa0on Retrieval

More on indexing CE-324: Modern Information Retrieval Sharif University of Technology

Yonik Seeley 29 June 2006 Dublin, Ireland

Documen(ng code, Javadoc, Defensive Programming, Asserts, Excep(ons & Try/Catch

Information Networks. Hacettepe University Department of Information Management DOK 422: Information Networks

Java Programming Unit 7. Error Handling. Collec7ons

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Linking Thesauri and Glossaries Case Study 0: linking a fake resource Roberto Navigli

Peace cannot be kept by force; it can only be achieved by understanding. Albert Einstein

Full-Text Indexing For Heritrix

10/8/2018 Programming Data Structures. class diagram for assignment 08 review: polymorphism review: exception new: File I/O

CS 200 File Input and Output Jim Williams, PhD

Today. Book-keeping. File I/O. Subscribe to sipb-iap-java-students. Inner classes. Debugging tools

COMP REST Programming in Eclipse

Searching the Web for Information

Classes and objects. Chapter 2: Head First Java: 2 nd Edi4on, K. Sierra, B. Bates

Remedial Java - Excep0ons 3/09/17. (remedial) Java. Jars. Anastasia Bezerianos 1

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

CS60092: Informa0on Retrieval. Sourangshu Bha<acharya

Behrang Mohit : txt proc! Review. Bag of word view. Document Named

Data Management in the Cloud NEO4J: GRAPH DATA MODEL

2018/2/5 话费券企业客户接入文档 语雀

Apache Lucene - Query Parser Syntax

CS6200 Informa.on Retrieval. David Smith College of Computer and Informa.on Science Northeastern University

Soir 1.4 Enterprise Search Server

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

CISC 323 (Week 9) Design of a Weather Program & Java File I/O

Web Server Project. Tom Kelliher, CS points, due May 4, 2011

Lucene Performance Workshop Lucid Imagination, Inc.

CH3: C# Programming Basics BUILD YOUR OWN ASP.NET 4 WEB SITE USING C# & VB

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

Java Programming Unit 9. Serializa3on. Basic Networking.

Advanced Indexing Techniques with Lucene

1.00/ Introduction to Computers and Engineering Problem Solving. Final / December 13, 2004

Lab 5: Java IO 12:00 PM, Feb 21, 2018

Indexing and Search with

Course work. Today. Last lecture index construc)on. Why compression (in general)? Why compression for inverted indexes?

US Patent 6,658,423. William Pugh

Flexible Full Text Search

Design and Implementation of Full Text Search Engine Based on Lucene Na-na ZHANG 1,a *, Yi-song WANG 1 and Kun ZHU 1

Transcription:

COMP 4601 Implemen0ng Search using Lucene 1

Luke: Lucene index analyzer WARNING: I HAVE NOT USED THIS 2

Scenario Crawler Crawl Directory containing tokenized content Lucene Lucene index directory 3

Classes for Indexing FSDirectory StandardAnalyzer IndexWriterConfig IndexWriter Document Field 4

Example Context Files have been crawled and important informa0on stored in new files with an TXT extension. Only content of interest has been saved and will be used for indexing. Your MongoDB code will be different but findone() equivalent to a file in this example. Code which follows is a SKETCH only. 5

Basic Algorithm Open an FSDirectory (the index). For each resource (i.e., a MongoDB document) Create a Lucene document Use each field Mongo document create a field in the Lucene document deciding whether to allow it to be searchable or not. Save the Lucent document. 6

Indexing try { File docdir = new File(CRAWL_DIR); dir = FSDirectory.open(new File(INDEX_DIR).toPath()); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); iwc.setopenmode(openmode.create); IndexWriter writer = new IndexWriter(dir, iwc); indexdocuments(writer, docdir); } catch (Excep0on e) { e.printstacktrace(); } finally { try { if (writer!= null) { writer.close(); if (dir!= null) dir.close(); } catch (IOExcep0on e) { e.printstacktrace(); } } INDEX_DIR = where I store index } 7

Indexing private void indexdocuments(indexwriter writer, File file) { if (file.canread()) { if (file.isdirectory()) { String[] files = file.list(); if (files!= null) { for (String name : files) indexdocuments(writer, new File(file, name)); } } else { FileInputStream fis; try { fis = new FileInputStream(file); indexafile(file, fis); fis.close(); } catch (Excep0on e) { e.printstacktrace(); } } } } 8

Indexing private void indexafile(file file, FileInputStream fis) throws IOExcep0on { doc = new Document(); Field pathfield = new StringField(PATH, file.getpath(), Field.Store.YES); doc.add(pathfield); try { int docid = Integer.valueOf(file.getName().replaceFirst("[.][^.]+$", "")); doc.add(new IntField(DOC_ID, docid, Field.Store.YES)); } catch (NumberFormatExcep0on e) { } doc.add(new StoredField(MODIFIED, file.lastmodified())); doc.add(new TextField(CONTENTS, new BufferedReader( new InputStreamReader(fis, "UTF-8")))); writer.adddocument(doc); } I am assuming that files are named with a document ID. This code removes the file extension (e.g.,.xml.) 9

Classes for Searching DirectoryReader FSDirectory IndexSearcher QueryParser Query( my search ) TopDocs ScoreDoc 10

Classes for Searching public ArrayList<COMP4601Document> query(string searchstring) { try { IndexReader reader = DirectoryReader.open( FSDirectory.open(new File(INDEX_DIR).toPath())); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser( contents, analyzer); Query q = parser.parse(searchstring); TopDocs results = searcher.search(q, 100); // 100 documents! ScoreDoc[] hits = results.scoredocs; reader.close(); return getdocs(hits); } catch (IOExcep0on ParseExcep0on e) { e.printstacktrace(); } return null; INDEX_DIR = where I store index 11 }

Classes for Searching public ArrayList<COMP4601Document> getdocs(scoredoc[] hits) { ArrayList<COMP4601Document> docs = new ArrayList<Document>(); for (ScoreDoc hit : hits) { Document indexdoc = searcher.doc(hit.doc); String id = indexdoc.get(doc_id); if (id!= null) { COMP4601Document d = find(integer.valueof(id)); if (d!= null) { d.setscore(hit.score); // Used in display to user docs.add(d); } } } return docs; This is a sketch. The class COMP4601Document is used here to differenrate it from the Lucence Document class. 12

Analyzers @Andy52 went to school yesterday! StandardAnalyzer [@Andy52] [went] [school] [yesterday!] StopAnalyzer [Andy] [went] [school] [yesterday] SimpleAnalyzer [andy] [went] [to] [school] [yesterday] WhitespaceAnalyzer [@Andy52] [went] [to] [school] [yesterday] KeywordAnalyzer [@Andy52 went to school yesterday!] 13

General Books Introduc0on to Informa0on Retrieval Not specific to Lucene, but about IR concepts Free e-book hwp://nlp.stanford.edu/ir-book/ P. Nayak and P. Raghavan: Introduc0on to Informa0on Retrieval 14

Books/Papers S. Brin and L. Page: The Anatomy of a Large- Scale Hypertextual Web Search Engine M. McCandless, E. Hatcher, and O. Gospodne0c: Lucene in Ac0on 2 nd Ed. hwp://www.manning.com/hatcher3/ 15

Web Resources Official Website hwp://lucene.apache.org/ StackOverflow hwp://stackoverflow.com/ques0ons/tagged/lucene Mailing lists hwp://lucene.apache.org/core/discussion.html Blogs hwp://www.lucidimagina0on.com/blog/ hwp://blog.mikemccandless.com/ hwp://lucene.gran0ngersoll.com/ 16

Gezng Started Gezng started: Download lucene-6.4.0.zip (or.tgz) Add to your Eclipse project: lucene-core-6.4.0.jar Lucene-queries-6.4.0.jar lucene-queryparser-6.4.0.jar Luke (Lucene Index Toolbox) hwp://code.google.com/p/luke/ 17

Advanced Material Not Required or lectured but provided as backup material (possibly used later in the course) 18

Query-0me Analysis Text in a query is analyzed like fields Use the same analyzer that analyzed the par0cular field +field1: quick brown fox +(field2: lazy dog field2: cozy cat ) quick brown fox lazy dog cozy cat 19

Query Forma0on Query parsing A query parser in core code Addi0onal query parsers in contributed code Or build query from the Lucene query classes 20

Term Query Matches documents with a par0cular term Field Text 21

Term Range Query Matches documents with any of the terms in a par0cular range Field Lowest term text Highest term text Include lowest term text? Include highest term text? 22

Prefix Query Matches documents with any of the terms with a par0cular prefix Field Prefix 23

Wildcard/Regex Query Matches documents with any of the terms that match a par0cular pawern Field Pawern Wildcard: * for 0+ characters,? for 0-1 character Regular expression Pawern matching on individual terms only 24

Fuzzy Query Matches documents with any of the terms that are similar to a par0cular term Levenshtein distance ( edit distance ): Number of character inser0ons, dele0ons or subs0tu0ons needed to transform one string into another e.g. kiwen -> siwen -> siwin -> sizng (3 edits) Field Text Minimum similarity score 25

Phrase Query Matches documents with all the given words present and being near each other Field Terms Slop Number of moves of words permiwed Slop = 0 means exact phrase match required 26

Boolean Query Conceptually similar to boolean operators ( AND, OR, NOT ), but not iden0cal Why Not AND, OR, And NOT? hwp://www.lucidimagina0on.com/blog/ 2011/12/28/why-not-and-or-and-not/ In short, boolean operators do not handle > 2 clauses well 27

Boolean Query Three types of clauses Must Should Must not For a boolean query to match a document All must clauses must match All must not clauses must not match At least one must or should clause must match 28

Filtering A Filter narrows down the search result Creates a set of document IDs Decides what documents get processed further Does not affect scoring, i.e. does not score/rank documents that pass the filter Can be cached easily Useful for access control, presets, etc. 29

Notable Filter classes TermsFilter Allows documents with any of the given terms TermRangeFilter Filter version of TermRangeQuery PrefixFilter Filter version of PrefixQuery QueryWrapperFilter Adapts a query into a filter CachingWrapperFilter Cache the result of the wrapped filter 30

Sor0ng Score (default) Index order Field Requires the field be indexed & not analyzed Specify type (string, int, etc.) Normal or reverse order Single or mul0ple fields 31

ADVANCED MATERIAL: NOT LECTURED 32

Span Query Similar to other queries, but matches spans Span par0cular place/part of a par0cular document <document ID, start posi0on, end posi0on> tuple 33

T 0 = "it is what it is 0 1 2 3 4 T 1 = "what is it 0 1 2 T 2 = "it is a banana 0 1 2 3 it is : <doc ID, start pos., end pos.> <0, 0, 2> <0, 3, 5> <2, 0, 2> 34

Span Query SpanTermQuery Same as TermQuery, except your can build other span queries with it SpanOrQuery Matches spans that are matched by any of some span queries SpanNotQuery Matches spans that are matched by one span query but not the other span query 35

spanterm(apple) spanor([apple, orange]) apple orange apple orange spanterm(orange) spannot(apple, orange) 36

Span Query SpanNearQuery Matches spans that are within a certain slop of each other Slop: max number of posi0ons between spans Can specify whether order mawers 37

the quick brown fox 2 1 0 1. spannear([brown, fox, the, quick], slop = 4, inorder = false) 2. spannear([brown, fox, the, quick], slop = 3, inorder = false) 3. spannear([brown, fox, the, quick], slop = 2, inorder = false) 4. spannear([brown, fox, the, quick], slop = 3, inorder = true) 5. spannear([the, quick, brown, fox], slop = 3, inorder = true) 38

Interfacing Lucene with Outside Embedding directly Language bridge E.g. PHP/Java Bridge Web service E.g. Jewy + your own request handler Solr (perhaps later) Lucene + Jewy + lots of useful func0onality 39