COMP Implemen0ng Search using Lucene

COMP 4601 Implemen0ng Search using Lucene 1

Luke: Lucene index analyzer WARNING: I HAVE NOT USED THIS 2

Scenario Crawler Crawl Directory containing tokenized content Lucene Lucene index directory 3

Classes for Indexing FSDirectory StandardAnalyzer IndexWriterConfig IndexWriter Document Field 4

Example Context Files have been crawled and important informa0on stored in new files with an TXT extension. Only content of interest has been saved and will be used for indexing. Your MongoDB code will be different but findone() equivalent to a file in this example. Code which follows is a SKETCH only. 5

Basic Algorithm Open an FSDirectory (the index). For each resource (i.e., a MongoDB document) Create a Lucene document Use each field Mongo document create a field in the Lucene document deciding whether to allow it to be searchable or not. Save the Lucent document. 6

Indexing try { File docdir = new File(CRAWL_DIR); dir = FSDirectory.open(new File(INDEX_DIR).toPath()); Analyzer analyzer = new StandardAnalyzer(); IndexWriterConfig iwc = new IndexWriterConfig(analyzer); iwc.setopenmode(openmode.create); IndexWriter writer = new IndexWriter(dir, iwc); indexdocuments(writer, docdir); } catch (Excep0on e) { e.printstacktrace(); } finally { try { if (writer!= null) { writer.close(); if (dir!= null) dir.close(); } catch (IOExcep0on e) { e.printstacktrace(); } } INDEX_DIR = where I store index } 7

Indexing private void indexdocuments(indexwriter writer, File file) { if (file.canread()) { if (file.isdirectory()) { String[] files = file.list(); if (files!= null) { for (String name : files) indexdocuments(writer, new File(file, name)); } } else { FileInputStream fis; try { fis = new FileInputStream(file); indexafile(file, fis); fis.close(); } catch (Excep0on e) { e.printstacktrace(); } } } } 8

Indexing private void indexafile(file file, FileInputStream fis) throws IOExcep0on { doc = new Document(); Field pathfield = new StringField(PATH, file.getpath(), Field.Store.YES); doc.add(pathfield); try { int docid = Integer.valueOf(file.getName().replaceFirst("[.][^.]+$", "")); doc.add(new IntField(DOC_ID, docid, Field.Store.YES)); } catch (NumberFormatExcep0on e) { } doc.add(new StoredField(MODIFIED, file.lastmodified())); doc.add(new TextField(CONTENTS, new BufferedReader( new InputStreamReader(fis, "UTF-8")))); writer.adddocument(doc); } I am assuming that files are named with a document ID. This code removes the file extension (e.g.,.xml.) 9

Classes for Searching DirectoryReader FSDirectory IndexSearcher QueryParser Query( my search ) TopDocs ScoreDoc 10

Classes for Searching public ArrayList<COMP4601Document> query(string searchstring) { try { IndexReader reader = DirectoryReader.open( FSDirectory.open(new File(INDEX_DIR).toPath())); IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(); QueryParser parser = new QueryParser( contents, analyzer); Query q = parser.parse(searchstring); TopDocs results = searcher.search(q, 100); // 100 documents! ScoreDoc[] hits = results.scoredocs; reader.close(); return getdocs(hits); } catch (IOExcep0on ParseExcep0on e) { e.printstacktrace(); } return null; INDEX_DIR = where I store index 11 }

Classes for Searching public ArrayList<COMP4601Document> getdocs(scoredoc[] hits) { ArrayList<COMP4601Document> docs = new ArrayList<Document>(); for (ScoreDoc hit : hits) { Document indexdoc = searcher.doc(hit.doc); String id = indexdoc.get(doc_id); if (id!= null) { COMP4601Document d = find(integer.valueof(id)); if (d!= null) { d.setscore(hit.score); // Used in display to user docs.add(d); } } } return docs; This is a sketch. The class COMP4601Document is used here to differenrate it from the Lucence Document class. 12

Analyzers @Andy52 went to school yesterday! StandardAnalyzer [@Andy52] [went] [school] [yesterday!] StopAnalyzer [Andy] [went] [school] [yesterday] SimpleAnalyzer [andy] [went] [to] [school] [yesterday] WhitespaceAnalyzer [@Andy52] [went] [to] [school] [yesterday] KeywordAnalyzer [@Andy52 went to school yesterday!] 13

General Books Introduc0on to Informa0on Retrieval Not specific to Lucene, but about IR concepts Free e-book hwp://nlp.stanford.edu/ir-book/ P. Nayak and P. Raghavan: Introduc0on to Informa0on Retrieval 14

Books/Papers S. Brin and L. Page: The Anatomy of a Large- Scale Hypertextual Web Search Engine M. McCandless, E. Hatcher, and O. Gospodne0c: Lucene in Ac0on 2 nd Ed. hwp://www.manning.com/hatcher3/ 15

Web Resources Official Website hwp://lucene.apache.org/ StackOverflow hwp://stackoverflow.com/ques0ons/tagged/lucene Mailing lists hwp://lucene.apache.org/core/discussion.html Blogs hwp://www.lucidimagina0on.com/blog/ hwp://blog.mikemccandless.com/ hwp://lucene.gran0ngersoll.com/ 16

Gezng Started Gezng started: Download lucene-6.4.0.zip (or.tgz) Add to your Eclipse project: lucene-core-6.4.0.jar Lucene-queries-6.4.0.jar lucene-queryparser-6.4.0.jar Luke (Lucene Index Toolbox) hwp://code.google.com/p/luke/ 17

Advanced Material Not Required or lectured but provided as backup material (possibly used later in the course) 18

Query-0me Analysis Text in a query is analyzed like fields Use the same analyzer that analyzed the par0cular field +field1: quick brown fox +(field2: lazy dog field2: cozy cat ) quick brown fox lazy dog cozy cat 19

Query Forma0on Query parsing A query parser in core code Addi0onal query parsers in contributed code Or build query from the Lucene query classes 20

Term Query Matches documents with a par0cular term Field Text 21

Term Range Query Matches documents with any of the terms in a par0cular range Field Lowest term text Highest term text Include lowest term text? Include highest term text? 22

Prefix Query Matches documents with any of the terms with a par0cular prefix Field Prefix 23

Wildcard/Regex Query Matches documents with any of the terms that match a par0cular pawern Field Pawern Wildcard: * for 0+ characters,? for 0-1 character Regular expression Pawern matching on individual terms only 24

Fuzzy Query Matches documents with any of the terms that are similar to a par0cular term Levenshtein distance ( edit distance ): Number of character inser0ons, dele0ons or subs0tu0ons needed to transform one string into another e.g. kiwen -> siwen -> siwin -> sizng (3 edits) Field Text Minimum similarity score 25

Phrase Query Matches documents with all the given words present and being near each other Field Terms Slop Number of moves of words permiwed Slop = 0 means exact phrase match required 26

Boolean Query Conceptually similar to boolean operators ( AND, OR, NOT ), but not iden0cal Why Not AND, OR, And NOT? hwp://www.lucidimagina0on.com/blog/ 2011/12/28/why-not-and-or-and-not/ In short, boolean operators do not handle > 2 clauses well 27

Boolean Query Three types of clauses Must Should Must not For a boolean query to match a document All must clauses must match All must not clauses must not match At least one must or should clause must match 28

Filtering A Filter narrows down the search result Creates a set of document IDs Decides what documents get processed further Does not affect scoring, i.e. does not score/rank documents that pass the filter Can be cached easily Useful for access control, presets, etc. 29

Notable Filter classes TermsFilter Allows documents with any of the given terms TermRangeFilter Filter version of TermRangeQuery PrefixFilter Filter version of PrefixQuery QueryWrapperFilter Adapts a query into a filter CachingWrapperFilter Cache the result of the wrapped filter 30

Sor0ng Score (default) Index order Field Requires the field be indexed & not analyzed Specify type (string, int, etc.) Normal or reverse order Single or mul0ple fields 31

ADVANCED MATERIAL: NOT LECTURED 32

Span Query Similar to other queries, but matches spans Span par0cular place/part of a par0cular document <document ID, start posi0on, end posi0on> tuple 33

T 0 = "it is what it is 0 1 2 3 4 T 1 = "what is it 0 1 2 T 2 = "it is a banana 0 1 2 3 it is : <doc ID, start pos., end pos.> <0, 0, 2> <0, 3, 5> <2, 0, 2> 34

Span Query SpanTermQuery Same as TermQuery, except your can build other span queries with it SpanOrQuery Matches spans that are matched by any of some span queries SpanNotQuery Matches spans that are matched by one span query but not the other span query 35

spanterm(apple) spanor([apple, orange]) apple orange apple orange spanterm(orange) spannot(apple, orange) 36

Span Query SpanNearQuery Matches spans that are within a certain slop of each other Slop: max number of posi0ons between spans Can specify whether order mawers 37

the quick brown fox 2 1 0 1. spannear([brown, fox, the, quick], slop = 4, inorder = false) 2. spannear([brown, fox, the, quick], slop = 3, inorder = false) 3. spannear([brown, fox, the, quick], slop = 2, inorder = false) 4. spannear([brown, fox, the, quick], slop = 3, inorder = true) 5. spannear([the, quick, brown, fox], slop = 3, inorder = true) 38

Interfacing Lucene with Outside Embedding directly Language bridge E.g. PHP/Java Bridge Web service E.g. Jewy + your own request handler Solr (perhaps later) Lucene + Jewy + lots of useful func0onality 39