Text Search With Lucene

Size: px

Start display at page:

Download "Text Search With Lucene"

Lorena Dean
6 years ago
Views:

1 Text Search With Lucene Please refer to Geode documentation with final implementation is here. Requirements Related Documents Terminology API User Input Key points Java API Examples Gfsh API XML Configuration REST API Spring Data GemFire Support Implementation Flowchart Inside LuceneIndex A closer look at Partitioned region data flow Processing Queries Implementation Details Index Storage Storage with different region types Walkthrough creating index in Geode region Handling failures, restarts, and rebalance Aggregation Result collection and paging JMX MBean Please refer to Geode documentation with final implementation is here. Requirements Out of Scope Allow user to create Lucene Indexes on data stored in Geode Update the indexes asynchronously to avoid impacting write latency Allow user to perform text (Lucene) search on Geode data using the Lucene index. Results from the text searches may be stale due to asynchronous index updates. Provide highly available of indexes using Geode's HA capabilities Scalability Performance comparable to RAMFSDirectory Building next/better Solr/Elasticsearch. Enhancing the current Geode OQL to use Lucene index. Related Documents A previous integration of Lucene and GemFire: Similar efforts done by other data products Hibernate Search: Hibernate search Solandra: Solandra embeds Solr in Cassandra. Terminology Documents: In Lucene, a Document is the unit of search and index. An index consists of one or more Documents. Fields: A Document consists of one or more Fields. A Field is simply a name-value pair. Indexing involves adding Documents to an IndexWriter, and searching involves retrieving Documents from an index via an IndexSearcher.

2 API User Input A region and list of to-be-indexed fields [ Optional ] Specified Analyzer for fields or Standard Analyzer if not specified with fields Key points A single index will not support multiple regions. Join queries between regions are not supported Heterogeneous objects in single region will be supported Only top level fields of nested objects can be indexed, not nested collections The index needs to be created before the region is created (for phase1) Pagination of results will be supported Users will interact with a new LuceneService interface, which provides methods for creating indexes and querying. Users can also create indexes through gfsh or cache.xml. Java API Now that this feature has been implemented, please refer to the javadocs for details on the Java API. Examples // Get LuceneService LuceneService luceneservice = LuceneServiceProvider.get(cache); // Create Index on fields with default analyzer: luceneservice.createindex(indexname, regionname, "field1", "field2", "field3"); // create index on fields with specified analyzer: Map<String, Analyzer> analyzerperfield = new HashMap<String, Analyzer>(); analyzerperfield.put("field1", new StandardAnalyzer()); analyzerperfield.put("field2", new KeywardAnalyzer()); luceneservice.createindex(indexname, regionname, analyzerperfield); Region region = cache.createregionfactory(regionshutcut.partition).create(regionname); // Create Query LuceneQuery query = luceneservice.createlucenequeryfactory().setlimit(200).setpagesize(20).create(indexname, regionname, querystring, "field1" /* default field */); // Search using Query PageableLuceneQueryResults<K,Object> results = query.findpages(); // Pagination while (results.hasnext()) { results.next().stream().foreach(struct -> { Object value = struct.getvalue(); System.out.println("Key is "+struct.getkey()+", value is "+value); }); }

3 Gfsh API // List Index gfsh> list lucene indexes [with-stats] // Create Index gfsh> create lucene index --name=indexname --region=/orders --field=customer,tags // Create Index gfsh> create lucene index --name=indexname --region=/orders --field=customer,tags --analyzer=org.apache.lucene.analysis.standard.standardanalyzer,org.apache.lucene.anal ysis.bg.bulgariananalyzer Execute Lucene query gfsh> search lucene --regionname=/orders -querystrings="john*" --defaultfield=field1 --limit=100 XML Configuration <cache xmlns=" xmlns:lucene=" xmlns:xsi=" xsi:schemalocation=" version="1.0"> <region name="region" refid="partition"> <lucene:index name="index"> <lucene:field name="a" analyzer="org.apache.lucene.analysis.core.keywordanalyzer"/> <lucene:field name="b" analyzer="org.apache.lucene.analysis.core.simpleanalyzer"/> <lucene:field name="c" analyzer="org.apache.lucene.analysis.standard.classicanalyzer"/> </lucene:index> </region> </cache> REST API TBD - But using solr to provide a REST API might make a lot of sense

4 Spring Data GemFire Support TBD - But the Searchable annotation described in this blog might be a good place to start. Implementation Flowchart Inside LuceneIndex

5 A closer look at Partitioned region data flow

6 Processing Queries

7 Implementation Details Index Storage The lucene indexes will be stored in memory instead of disk. This will be done by implementing a lucene Directory called RegionDirectory which uses Geode as a flat file system. This way we get all the benefits offered by Geode and we can achieve replication and shard-ing of the indexes. The lucene indexes will be co-located with the data region in case of HA. A LuceneIndex object will be created for each index, to manage all the attributes related with the index, such as reflection fields, AEQ listener, RegionDirectory array, Search, etc. If user's data region is a partitioned region, there will be one LuceneIndex is for the partitioned region. Every bucket in the data region will have its own RegionDirectory (implements Lucene's Directory interface), which keeps the FileSystem for index regions. Index regions contain 2 regions: FileRegion : holds the meta data about indexing files ChunkRegion : Holds the actual data chunks for a given index file. The FileRegion and ChunkRegion will be collocated with the data region which is to be indexed. The FileRegion and ChunkRegion will have partition resolver that looks at the bucket id part of the key only. An AsyncEventQueue will be used to update the LuceneIndex. AsyncEventListener will procoess the events in AEQ in batch. When a data entry is processed 1. create document for indexed fields. Indexed field values are obtained from AsyncEvent through reflection (in case of domain object) or by

8 2. 3. PdxInstance interface (in case pdx or JSON); constructing Lucene document object and adding it to the LuceneIndex associated with that region. determine the bucket id of the entry. Get the RegionDirectory for that bucket, save the document into RegionDirectory. Storage with different region types PersistentRegions The Lucene Index will be persisted. OverflowRegions The Lucene Index will not be overflowed. The rational here is that the Lucene index will be much smaller than the data size, so it is not necessary to overflow the index. EmptyRegions The Lucene Index not supported OffHeapRegions The Lucene index will be stored in OffHeap Walkthrough creating index in Geode region 1) Create a LuceneIndex object to hold the data structures that will be created in following steps. This object will be registered to cache owned LuceneService later. 2) LuceneIndex will keep all the reflective fields. 3 ) Assume the dataregion is PartitionedRegion (otherwise, no need to define PartitionResolver). Create a FileRegion (let's call it "fr") and a ChunkRegion (let's call it "cr"), collocated with Data Region (let's name it "dataregion"). Define PartitionResolver to use dataregion's bucket id as routing object, which will guarantee the index bucket region will be the same bucket id as the dataregion's bucket region's even when dataregion has its own customer-defined PartitionResolver. We don't nedd to define PartitionResolver on dataregion. 4) FileRegion and ChunkRegion use the same region attributes as dataregion. In partitioned region case, the FileRegion and ChunkRegion will be under the same parent region, i.e. /root in this example. In replicated region case, the index regions will be root regions all the time. 5) Create a RegionDirectory object for a bucket using the FileRegion and ChunkRegion's same bucket. 6) Create PerFieldAnalyzerWrapper and save the fields in LuceneIndex. 7) Create a Lucene's IndexWriterConfig object using Analyzer. 8) Create a Lucene's IndexWriter object using GeodeDirectory and IndexWriterConfig object. 9) Define AEQ with multiple dispatcher threads and order-policy=partition. That will group events by bucket id into different dispatcher queues. Each dispatcher thread will call our AEQ listener to process events for one or more buckets. Each event will be processed to be document and write into ChunkRegion via RegionDirectory. We don't need lock for RegionDirectory, since only one thread will process one bucket's events. 10) If dataregion is a replicated region, then define AEQ with single dispatcher thread. 11) Register the newly created LuceneIndex into LuceneService. The registration step will also publish the meta data into the "lucene_meta_region" which is a persistent replicate region, then other JVM will know a new luceneindex with these meta data was created. All the members should have a LuceneService instance with the same LuceneIndex definition. Index Maintenance LuceneIndex can be created and destroy. We don't support creating index on a region with data for now. Handling failures, restarts, and rebalance The index region and async event queue will be restored with its colocated data region's buckets. So during failover the new primary should be able to read/write index as usual. Aggregation In the case of partitioned regions, the query must be sent out to all the primaries. The results will then need to be aggregated back together. Luce ne search will use FunctionService to distribute query to primaries. Input to primaries Serialized Query CollectorManager to be used for local aggregation Result limit Output from primaries 1. Merged collector created from results of search on local bucket indexes.

9 We are still investigating options for how to aggregate the data, see Text Search Aggregation Options. In case of replicated regions, query will be sent to one of the members and get the results there. Aggregation will be handled in that member before returned to the caller. Result collection and paging The ResultSet will support pagination mechanism to retrieve the results. All the keys are aggregated at the query executor node (client or peer); and getall is used to fetch the values according to page size. JMX MBean A Lucene Service MBean is available and accessed through an ObjectName like: GemFire:service=CacheService,name=LuceneService,type=Member,member= (59583)<ec><v5>-1026 This MBean provides operations these operations:

10 LuceneServiceMBean API /** * Returns an array of {@link LuceneIndexMetrics} for the {@link com.gemstone.gemfire.cache.lucene.luceneindex} * instances defined in this member * an array of LuceneIndexMetrics for the LuceneIndexes defined in this member */ public LuceneIndexMetrics[] listindexmetrics(); /** * Returns an array of {@link LuceneIndexMetrics} for the {@link com.gemstone.gemfire.cache.lucene.luceneindex} * instances defined on the input region in this member * regionpath The full path of the region to retrieve * an array of LuceneIndexMetrics for the LuceneIndex instances defined on the input region * in this member */ public LuceneIndexMetrics[] listindexmetrics(string regionpath); /** * Returns a {@link LuceneIndexMetrics} for the {@link com.gemstone.gemfire.cache.lucene.luceneindex} * with the input index name defined on the input region in this member. * regionpath The full path of the region to retrieve indexname The name of the index to retrieve * a LuceneIndexMetrics for the LuceneIndex with the input index name defined on the input region * in this member. */ public LuceneIndexMetrics listindexmetrics(string regionpath, String indexname); A LuceneIndexMetrics data bean includes raw stat values like: LuceneIndexMetrics Sample Region=/data2; index=full_index committime-> commits->5999 commitsinprogress->0 documents->498 queryexecutiontime->0 queryexecutiontotalhits->0 queryexecutions->0 queryexecutionsinprogress->0 updatetime-> updates->6419 updatesinprogress->0

11 Limitations include: no rates or average latencies are available no aggregation (which means no rollups across members in the GemFire -> Distributed MBean)

(incubating) Introduction. Swapnil Bawaskar.

(incubating) Introduction. Swapnil Bawaskar. (incubating) Introduction William Markito @william_markito Swapnil Bawaskar @sbawaskar Agenda Introduction What? Who? Why? How? DEBS Roadmap Q&A 2 3 Introduction Introduction A distributed, memory-based