Apache Lucene Eurocon: Preview

Size: px

Start display at page:

Download "Apache Lucene Eurocon: Preview"

Curtis Summers
6 years ago
Views:

1 Apache Lucene Eurocon: Preview

2 Overview Introduction Near Real Time Search: Yonik Seeley A link to download these slides will be available after the webcast is complete. An on-demand replay will be ready in ~48 hours. Munching & Crunching: Andrzej Białecki Solr in the Cloud: Mark Miller Practical Relevance: Grant Ingersoll Q&A 2

3 Near Real Time Search Yonik Seeley

4 Near Real-Time Search Shorter times until updates are searchable/visible Lucene 2.9 first laid the groundwork w/ per-segment searching Per-segment FieldCache entries for sorting and FunctionQueries NRT IndexWriter.getReader() Make new segments available before merging is done in background Doesn t cause commit/fsync first Solr still needs Per-segment faceting Per-segment caching Per-segment statistics (and anything else that uses FieldCache) 4

5 Existing single-values faceting algorithm Documents matching the base query Juggernaut Lucene FieldCache Entry (StringIndex) for the hero field q=juggernaut &facet=true &facet.field=hero accumulator lookup increment order: for each doc, an index into the lookup array lookup: the string values (null) batman flash spiderman superman wolverine 5

6 Per-segment single-valued faceting algorithm Segment1 FieldCache Entry Segment2 FieldCache Entry Segment3 FieldCache Entry Segment4 FieldCache Entry accumulator1 accumulator2 accumulator3 accumulator4 lookup inc thread2 thread3 thread Base DocSet thread1 FieldCache + accumulator merger (Priority queue) Priority queue flash, 5 Batman, 3 6

7 Per-segment faceting Enable with facet.method=fcs Controllable multi-threading facet.field={!threads=4}myfield Disadvantages Larger memory use (FieldCaches + accumulators) Slower (extra FieldCache merge step needed) Advantages Rebuilds FieldCache entries only for new segments (NRT friendly) Multi-threaded 7

8 Per-segment faceting performance comparison Test index: 10M documents, 18 segments, single valued field A Base DocSet=100 docs, facet.field on a field with 100,000 unique terms Time for request* facet.method=fc facet.method=fcs static index 3 ms 244 ms quickly changing index 1388 ms 267 ms B Base DocSet=1,000,000 docs, facet.field on a field with 100 unique terms Time for request* facet.method=fc facet.method=fcs static index 26 ms 34 ms quickly changing index 741 ms 94 ms *complete request time, measured externally 8

9 9 Munching & Crunching Lucene index post-processing and applications Andrzej Białecki

10 Munching & Crunching Agenda Post-processing Splitting, merging, sorting, pruning Tiered search Bitwise search Map-reduce indexing models 10

11 Post-processing Isn't it better to build it right from the start? Some parameters are difficult to get right... Minimizing index size while retaining search quality Correcting impact of unexpected common words Creating evenly-sized shards...perhaps impossible to get at all during indexing Adding collection-wide factors not computed by Lucene (e.g. avg. length) Optimizing top-n results for common queries Fitting too large indexes in RAM 11

12 Merging, splitting, sorting, pruning Splitting: IndexSplitter, MultiPassIndexSplitter, TheTrueSplitter Sorting postings by impact and early termination search Index pruning: What data to remove and how? Pruning strategies Challenges 12

13 Tiered search Assuming we CAN prune effectively, while maintaining good search quality... SSD search box RAM 70% pruned 30% pruned? HDD 0% pruned 13

14 Tiered search Assuming we CAN prune effectively, while maintaining good search quality... search box 1 search box 2 SSD RAM 70% pruned 30% pruned? search box 3 HDD 0% pruned 14

15 Bit-wise search Given a bit pattern query: Find best matching bit patterns in documents Applications: Fuzzy fingerprinting De-duplication Plagiarism detection BitwiseSearcher and Solr BitwiseField design 15

16 Massive indexing Map-reduce indexing models Google model Nutch model Modified Nutch model Hadoop contrib/indexing model Tradeoff analysis and recommendations 16

17 1 Solr in the Cloud Mark Miller 17

18 182

19 Some of the Complications? Dealing with config files Setting up high availability Status of cluster Reshaping/Rebalancing cluster 19 19

20 Improvements: High Level Goals Improve... Shared/Central Config High Availability and Fault Tolerance Cluster Resizing/Rebalancing Open/Standard ZK schema Cluster status

21 Enter Solr Cloud and ZooKeeper ZooKeeper is basically a highly available distributed filesystem Config and cluster state live in ZooKeeper Solr is alerted to changes in cluster state by ZK Solr gets a built in load balancing impl that can read cluster state from ZK Clients don t need to know about shards - or can choose logical shards 21

22 What s Been Done So Far A lot of base work - ZooKeeper Mode Shared/Central config Built in search side fault tolerance Very simple cluster status 22

23 The Future? Index side fault tolerance Cluster resizing/rebalancing/elasticity More Solr/ZK tools? Lots of other little fun improvements 23

24 Practical Relevance Grant Ingersoll 2010 Prague, Czech Republic 24

25 Why Tune Relevance? Better search results = Less time searching, more time acting Less time searching = Happier, more effective users Happier, more effective users = $,,, Kč (earned/saved) $,,, Kč (earned/saved) = Big fat raise for you! 25

26 Testing Relevance A/B testing Log Analysis Empirical Top 50 queries, plus random sample Ask Ratings/Reviews Focus Groups Also: Ad Hoc, TREC, etc. 26

27 Understand your Domain Types of documents Languages present Document structures, metadata and other features Lexical resources: jargon, synonyms, abbreviations... Relationships between documents Users Sophistication/Expertise Search and Discovery needs Known Item vs. Keyword Tolerance for Pain Managers Business Interests Release cycles Obsession in finding the one true relevance model (hint, it doesn t exist) explain() blindness 27

28 Phrases Almost always a win to automatically add phrase query variations to all multiword queries Even better to detect key phrases In Solr, with the Dismax handler, use the &pf and &ps options to automatically add phrase boosts Using a large slop factor can simulate an AND query while rewarding close proximity See also the ComplexPhraseQuery in contrib/queryparser Consider SpanQuery and derivatives 28

29 Resources ACM SIGIR Experts/Articles/Debugging-Relevance-Issues-Search Experts/Articles/Optimizing-Findability-Lucene-and-Solr Open Relevance Project: 29

30 Q&A SLIDES POSTED AT: BIT.LY/EXPERTS1 30

31 1 Thank You 31

How to tackle performance issues when implementing high traffic multi-language search engine with Solr/Lucene

How to tackle performance issues when implementing high traffic multi-language search engine with Solr/Lucene André Bois-Crettez Anca Kopetz Software Architect Software Engineer Berlin Buzzwords 2014 Outline