Technical Deep Dive: Cassandra + Solr Confiden7al
Business case 2
Super scalable realtime analytics Hadoop is fantastic at performing batch analytics Cassandra is an advanced column family oriented system Solr offers realtime analytics like a traditional RDBMS (except joins) 3
What is Solr? 4
Lucene High performance inverted index: Java based Embeddable library... 5
Solr Distributed search Facets Schemas Dismax queries 6
Terms Posting List Term to integer document id list. dog = [0,3,6,7,9] cat = [1,2,3,5,9] Terms are stored in sorted order. 7
Query Execution Query is parsed into terms Each term is looked up from the terms dictionary For each term, the posting list is iterated, and conjoined or disjoined with the other term s posting lists 8
Datastax Enterprise (DSE) 9
DSE Cluster 10
Datastax Enterprise Combines Cassandra with Solr Best of both worlds Distributed Dynamo based data distribution Reliable proven scalability Lucene and Solr 4.0 11
DSE Solr Features Near realtime search Multiple data centers Reindex directly from Cassandra Fast transaction log Run MapReduce on Solr data Realtime analytics 12
DSE Solr Architecture Extends Cassandra secondary index API Distributes queries using ring topology over HTTP Data stored in Cassandra Lucene index stored on each node directly on the OS filesystem (index is not stored in Cassandra) Index per column family only 13
DSE Solr Architecture Schema and configuration stored in Cassandra Updates can hit any server, routed to the correct node(s) automatically RandomPartitioner MD5 hashes documents / rows to the correct node(s) 14
Architecture How Solr is integrated into Cassandra 15
DSE Solr Search Queries are automatically distributed to online nodes in the cluster When replication factor > 1, queries are load balanced 16
DSE Solr Commit Log Commit log is sync with Solr If a node crashes, no data is lost, the commit log is replayed on restart 17
DSE Solr Data Model 18
DSE Best Practices 19
Production Increase replication factor for more queries per second Like Cassandra, allocate enough RAM, the system IO cache determines queries per second and query latency 20
Heap Space Field caches used by sorting and facets Terms dictionary index The index is not loaded into heap Rely on the system IO cache 21
Loading Configuration Files into DSE DSE stores the configuration files in Cassandra Same configuration files used for each node Use curl to HTTP POST the schema.xml and solrconfig.xml files into DSE 22
Near Realtime Search Use DSENRTCachingDirectoryFactory Small segments flushed to RAM Once large enough, the small segments are flushed to disk Set autosoftcommit to 1-5 seconds Reduce or eliminate the auto-warming in caches 23
Validation Log DSE Search stores Solr analyzing errors in the validation log /var/log/cassandra/solrvalidation.log 24
DSENRTCachingDirectory Factory maxmergesizemb - The threshold (MB) for writing a merge segment to a RAMDirectory or to the file system maxcachemb - The maximum value (MB) of the RAMDirectory 25
Using DSE Comes with Wikipedia demonstration application Here is a quick example 26
Query using CQL Solr queries may be executed via CQL Here is a quick example SELECT title FROM solr WHERE solr_query='title:b*'; 27
Resource URL Configuration files are stored in Cassandra Same configuration per column family http://<host>:<port>/solr/resource/ <keyspace>.<columnfamily>/ <filename>.<ext> 28
Solr Admin Console http://localhost:8983/solr/wiki.solr/admin/ 29
Rebuilding an Index Indexes can be rebuilt Rebuilding is useful when the schema changes or the index has become corrupted./bin/dsetool rebuild_indexes wiki solr 30
Turn on Compression Text can usually be compressed by a large factor Turning on compression enables more data to use to system IO cache UPDATE COLUMN FAMILY solr WITH compression_options= {sstable_compression:snappycompressor, chunk_length_kb:64}; 31
General Solr 32
Important Ideas Queries Documents and Fields Analyzers Segments Schema 33
Documents and Fields Lucene indexes documents Document consist of fields Fields consist of a name and one or more values 34
Analyzers Convert text into tokens / terms Records the position of each token Converts tokens as per design, such as stemming 35
Segments Lucene stores the index in discrete units called segments A merge policy is set for how and when to merge (like compact) segments At query time, segments are accessed 36
Schema Structure First field types are defined such as primitives, then text fields and their analyzers 37
Schema Type Mapping Solr field types are mapped to native Cassandra types Solr Type Cassandra Type TextField UTF8Type LongField LongType IntField Int32Type StringField UTF8Type 38
Query Overview Solr queries offer many of the same features as SQL (except joins) Powerful, expressive, and fast 39
Query Types Search on any number of fields with boolean logic (AND, OR, +, -) Sort results per field similar to SQL Range queries Phrase queries Regular expression queries Query boosting (DisMax) 40
Filter Queries Cached bit sets No score calculated Good for queries with many results that are reused such as types or access controls 41
Debug Queries Pass in debug=true Provides info about timing of components Debug info about the query Debug info about the result scoring 42
Sort By Solr queries offer many of the same features as SQL (except joins) Powerful, expressive, and fast 43
Range Queries createdate: [1999-01-01T23:59:59.999Z TO *] field:[* TO 100] -field:[* TO *] finds all documents without a value for field 44
Phrase Query "data stax"~4 Search for "data and stax" within 4 words of each other 45
Prefix Queries myfield:foo* Queries cannot begin with an asterik 46
Regular Expressions Use forward slash to demarcate a regular expression query Match on a five-digit zip code body:/[0-9]{5}/ 47
Spatial Queries Bounding box Distance Filtering based on distance 48
Auto Suggest Uses SpellCheckComponent Spellcheck / suggest is built from an existing index Can be set to automatically rebuild the suggest index on commit 49
Prefix Auto Suggest It is recommended to use FSTLookup or WFSTLookup They are more memory efficient 50
Auto Suggest Parameters spellcheck TRUE spellcheck.dictionary suggest spellcheck.onlymorepopular TRUE spellcheck.count 5 (number of suggestions returned) StringField UTF8Type 51
Auto Suggest by Popular Queries Prefix based auto-suggest can be limiting Use EdgeNGramFilterFactory to query within terms Sort results by a hit count field 52
Dismax Query Parser Dismax query parser provides query time field level boosting granularity, with less special syntax Dismax generally makes the best first choice query parser for user facing Solr applications 53
Facets Intersection count of another query Commonly seen on shopping and other web sites Solr supports multi-select faceting Range faceting 54
Facets Parameters facet TRUE facet.field fields comma separated facet.query Query to facet on facet.method enum, fc, fcs (near realtime search) 55
Facet Example facet TRUE facet.field fields comma separated facet.query Query to facet on facet.method enum, fc, fcs (near realtime search) 56
Group By Much like SQL group by Sort group values Many options available, sort documents in a group, scroll results per-group No aggregations 57
Highlighting Highlighting re-analyzes each document Fast vector highlighter is faster however requires more storage 58
Highlighting Parameters hl TRUE hl.fl fields comma separated hl.usefastvectorhighlighte r true/false 59
The End jason.rutherglen @thinkbiganalytics.com 60
61