Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Size: px

Start display at page:

Wilfred Reed
6 years ago
Views:

1 Technical Deep Dive: Cassandra + Solr Confiden7al

2 Business case 2

3 Super scalable realtime analytics Hadoop is fantastic at performing batch analytics Cassandra is an advanced column family oriented system Solr offers realtime analytics like a traditional RDBMS (except joins) 3

4 What is Solr? 4

5 Lucene High performance inverted index: Java based Embeddable library... 5

6 Solr Distributed search Facets Schemas Dismax queries 6

7 Terms Posting List Term to integer document id list. dog = [0,3,6,7,9] cat = [1,2,3,5,9] Terms are stored in sorted order. 7

8 Query Execution Query is parsed into terms Each term is looked up from the terms dictionary For each term, the posting list is iterated, and conjoined or disjoined with the other term s posting lists 8

9 Datastax Enterprise (DSE) 9

10 DSE Cluster 10

11 Datastax Enterprise Combines Cassandra with Solr Best of both worlds Distributed Dynamo based data distribution Reliable proven scalability Lucene and Solr

12 DSE Solr Features Near realtime search Multiple data centers Reindex directly from Cassandra Fast transaction log Run MapReduce on Solr data Realtime analytics 12

13 DSE Solr Architecture Extends Cassandra secondary index API Distributes queries using ring topology over HTTP Data stored in Cassandra Lucene index stored on each node directly on the OS filesystem (index is not stored in Cassandra) Index per column family only 13

14 DSE Solr Architecture Schema and configuration stored in Cassandra Updates can hit any server, routed to the correct node(s) automatically RandomPartitioner MD5 hashes documents / rows to the correct node(s) 14

15 Architecture How Solr is integrated into Cassandra 15

16 DSE Solr Search Queries are automatically distributed to online nodes in the cluster When replication factor > 1, queries are load balanced 16

17 DSE Solr Commit Log Commit log is sync with Solr If a node crashes, no data is lost, the commit log is replayed on restart 17

18 DSE Solr Data Model 18

19 DSE Best Practices 19

20 Production Increase replication factor for more queries per second Like Cassandra, allocate enough RAM, the system IO cache determines queries per second and query latency 20

21 Heap Space Field caches used by sorting and facets Terms dictionary index The index is not loaded into heap Rely on the system IO cache 21

22 Loading Configuration Files into DSE DSE stores the configuration files in Cassandra Same configuration files used for each node Use curl to HTTP POST the schema.xml and solrconfig.xml files into DSE 22

23 Near Realtime Search Use DSENRTCachingDirectoryFactory Small segments flushed to RAM Once large enough, the small segments are flushed to disk Set autosoftcommit to 1-5 seconds Reduce or eliminate the auto-warming in caches 23

24 Validation Log DSE Search stores Solr analyzing errors in the validation log /var/log/cassandra/solrvalidation.log 24

25 DSENRTCachingDirectory Factory maxmergesizemb - The threshold (MB) for writing a merge segment to a RAMDirectory or to the file system maxcachemb - The maximum value (MB) of the RAMDirectory 25

26 Using DSE Comes with Wikipedia demonstration application Here is a quick example 26

27 Query using CQL Solr queries may be executed via CQL Here is a quick example SELECT title FROM solr WHERE solr_query='title:b*'; 27

28 Resource URL Configuration files are stored in Cassandra Same configuration per column family <keyspace>.<columnfamily>/ <filename>.<ext> 28

29 Solr Admin Console 29

30 Rebuilding an Index Indexes can be rebuilt Rebuilding is useful when the schema changes or the index has become corrupted./bin/dsetool rebuild_indexes wiki solr 30

31 Turn on Compression Text can usually be compressed by a large factor Turning on compression enables more data to use to system IO cache UPDATE COLUMN FAMILY solr WITH compression_options= {sstable_compression:snappycompressor, chunk_length_kb:64}; 31

32 General Solr 32

33 Important Ideas Queries Documents and Fields Analyzers Segments Schema 33

34 Documents and Fields Lucene indexes documents Document consist of fields Fields consist of a name and one or more values 34

35 Analyzers Convert text into tokens / terms Records the position of each token Converts tokens as per design, such as stemming 35

36 Segments Lucene stores the index in discrete units called segments A merge policy is set for how and when to merge (like compact) segments At query time, segments are accessed 36

37 Schema Structure First field types are defined such as primitives, then text fields and their analyzers 37

38 Schema Type Mapping Solr field types are mapped to native Cassandra types Solr Type Cassandra Type TextField UTF8Type LongField LongType IntField Int32Type StringField UTF8Type 38

39 Query Overview Solr queries offer many of the same features as SQL (except joins) Powerful, expressive, and fast 39

40 Query Types Search on any number of fields with boolean logic (AND, OR, +, -) Sort results per field similar to SQL Range queries Phrase queries Regular expression queries Query boosting (DisMax) 40

41 Filter Queries Cached bit sets No score calculated Good for queries with many results that are reused such as types or access controls 41

42 Debug Queries Pass in debug=true Provides info about timing of components Debug info about the query Debug info about the result scoring 42

43 Sort By Solr queries offer many of the same features as SQL (except joins) Powerful, expressive, and fast 43

44 Range Queries createdate: [ T23:59:59.999Z TO *] field:[* TO 100] -field:[* TO *] finds all documents without a value for field 44

45 Phrase Query "data stax"~4 Search for "data and stax" within 4 words of each other 45

46 Prefix Queries myfield:foo* Queries cannot begin with an asterik 46

47 Regular Expressions Use forward slash to demarcate a regular expression query Match on a five-digit zip code body:/[0-9]{5}/ 47

48 Spatial Queries Bounding box Distance Filtering based on distance 48

49 Auto Suggest Uses SpellCheckComponent Spellcheck / suggest is built from an existing index Can be set to automatically rebuild the suggest index on commit 49

50 Prefix Auto Suggest It is recommended to use FSTLookup or WFSTLookup They are more memory efficient 50

51 Auto Suggest Parameters spellcheck TRUE spellcheck.dictionary suggest spellcheck.onlymorepopular TRUE spellcheck.count 5 (number of suggestions returned) StringField UTF8Type 51

52 Auto Suggest by Popular Queries Prefix based auto-suggest can be limiting Use EdgeNGramFilterFactory to query within terms Sort results by a hit count field 52

53 Dismax Query Parser Dismax query parser provides query time field level boosting granularity, with less special syntax Dismax generally makes the best first choice query parser for user facing Solr applications 53

54 Facets Intersection count of another query Commonly seen on shopping and other web sites Solr supports multi-select faceting Range faceting 54

55 Facets Parameters facet TRUE facet.field fields comma separated facet.query Query to facet on facet.method enum, fc, fcs (near realtime search) 55

56 Facet Example facet TRUE facet.field fields comma separated facet.query Query to facet on facet.method enum, fc, fcs (near realtime search) 56

57 Group By Much like SQL group by Sort group values Many options available, sort documents in a group, scroll results per-group No aggregations 57

58 Highlighting Highlighting re-analyzes each document Fast vector highlighter is faster however requires more storage 58

59 Highlighting Parameters hl TRUE hl.fl fields comma separated hl.usefastvectorhighlighte r true/false 59

60 The End 60

61 61

Soir 1.4 Enterprise Search Server

Soir 1.4 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more David Smiley Eric Pugh *- PUBLISHING -J BIRMINGHAM - MUMBAI Preface