Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Technical Deep Dive: Cassandra + Solr Confiden7al

Business case 2

Super scalable realtime analytics Hadoop is fantastic at performing batch analytics Cassandra is an advanced column family oriented system Solr offers realtime analytics like a traditional RDBMS (except joins) 3

What is Solr? 4

Lucene High performance inverted index: Java based Embeddable library... 5

Solr Distributed search Facets Schemas Dismax queries 6

Terms Posting List Term to integer document id list. dog = [0,3,6,7,9] cat = [1,2,3,5,9] Terms are stored in sorted order. 7

Query Execution Query is parsed into terms Each term is looked up from the terms dictionary For each term, the posting list is iterated, and conjoined or disjoined with the other term s posting lists 8

Datastax Enterprise (DSE) 9

DSE Cluster 10

Datastax Enterprise Combines Cassandra with Solr Best of both worlds Distributed Dynamo based data distribution Reliable proven scalability Lucene and Solr 4.0 11

DSE Solr Features Near realtime search Multiple data centers Reindex directly from Cassandra Fast transaction log Run MapReduce on Solr data Realtime analytics 12

DSE Solr Architecture Extends Cassandra secondary index API Distributes queries using ring topology over HTTP Data stored in Cassandra Lucene index stored on each node directly on the OS filesystem (index is not stored in Cassandra) Index per column family only 13

DSE Solr Architecture Schema and configuration stored in Cassandra Updates can hit any server, routed to the correct node(s) automatically RandomPartitioner MD5 hashes documents / rows to the correct node(s) 14

Architecture How Solr is integrated into Cassandra 15

DSE Solr Search Queries are automatically distributed to online nodes in the cluster When replication factor > 1, queries are load balanced 16

DSE Solr Commit Log Commit log is sync with Solr If a node crashes, no data is lost, the commit log is replayed on restart 17

DSE Solr Data Model 18

DSE Best Practices 19

Production Increase replication factor for more queries per second Like Cassandra, allocate enough RAM, the system IO cache determines queries per second and query latency 20

Heap Space Field caches used by sorting and facets Terms dictionary index The index is not loaded into heap Rely on the system IO cache 21

Loading Configuration Files into DSE DSE stores the configuration files in Cassandra Same configuration files used for each node Use curl to HTTP POST the schema.xml and solrconfig.xml files into DSE 22

Near Realtime Search Use DSENRTCachingDirectoryFactory Small segments flushed to RAM Once large enough, the small segments are flushed to disk Set autosoftcommit to 1-5 seconds Reduce or eliminate the auto-warming in caches 23

Validation Log DSE Search stores Solr analyzing errors in the validation log /var/log/cassandra/solrvalidation.log 24

DSENRTCachingDirectory Factory maxmergesizemb - The threshold (MB) for writing a merge segment to a RAMDirectory or to the file system maxcachemb - The maximum value (MB) of the RAMDirectory 25

Using DSE Comes with Wikipedia demonstration application Here is a quick example 26

Query using CQL Solr queries may be executed via CQL Here is a quick example SELECT title FROM solr WHERE solr_query='title:b*'; 27

Resource URL Configuration files are stored in Cassandra Same configuration per column family http://<host>:<port>/solr/resource/ <keyspace>.<columnfamily>/ <filename>.<ext> 28

Solr Admin Console http://localhost:8983/solr/wiki.solr/admin/ 29

Rebuilding an Index Indexes can be rebuilt Rebuilding is useful when the schema changes or the index has become corrupted./bin/dsetool rebuild_indexes wiki solr 30

Turn on Compression Text can usually be compressed by a large factor Turning on compression enables more data to use to system IO cache UPDATE COLUMN FAMILY solr WITH compression_options= {sstable_compression:snappycompressor, chunk_length_kb:64}; 31

General Solr 32

Important Ideas Queries Documents and Fields Analyzers Segments Schema 33

Documents and Fields Lucene indexes documents Document consist of fields Fields consist of a name and one or more values 34

Analyzers Convert text into tokens / terms Records the position of each token Converts tokens as per design, such as stemming 35

Segments Lucene stores the index in discrete units called segments A merge policy is set for how and when to merge (like compact) segments At query time, segments are accessed 36

Schema Structure First field types are defined such as primitives, then text fields and their analyzers 37

Schema Type Mapping Solr field types are mapped to native Cassandra types Solr Type Cassandra Type TextField UTF8Type LongField LongType IntField Int32Type StringField UTF8Type 38

Query Overview Solr queries offer many of the same features as SQL (except joins) Powerful, expressive, and fast 39

Query Types Search on any number of fields with boolean logic (AND, OR, +, -) Sort results per field similar to SQL Range queries Phrase queries Regular expression queries Query boosting (DisMax) 40

Filter Queries Cached bit sets No score calculated Good for queries with many results that are reused such as types or access controls 41

Debug Queries Pass in debug=true Provides info about timing of components Debug info about the query Debug info about the result scoring 42

Sort By Solr queries offer many of the same features as SQL (except joins) Powerful, expressive, and fast 43

Range Queries createdate: [1999-01-01T23:59:59.999Z TO *] field:[* TO 100] -field:[* TO *] finds all documents without a value for field 44

Phrase Query "data stax"~4 Search for "data and stax" within 4 words of each other 45

Prefix Queries myfield:foo* Queries cannot begin with an asterik 46

Regular Expressions Use forward slash to demarcate a regular expression query Match on a five-digit zip code body:/[0-9]{5}/ 47

Spatial Queries Bounding box Distance Filtering based on distance 48

Auto Suggest Uses SpellCheckComponent Spellcheck / suggest is built from an existing index Can be set to automatically rebuild the suggest index on commit 49

Prefix Auto Suggest It is recommended to use FSTLookup or WFSTLookup They are more memory efficient 50

Auto Suggest Parameters spellcheck TRUE spellcheck.dictionary suggest spellcheck.onlymorepopular TRUE spellcheck.count 5 (number of suggestions returned) StringField UTF8Type 51

Auto Suggest by Popular Queries Prefix based auto-suggest can be limiting Use EdgeNGramFilterFactory to query within terms Sort results by a hit count field 52

Dismax Query Parser Dismax query parser provides query time field level boosting granularity, with less special syntax Dismax generally makes the best first choice query parser for user facing Solr applications 53

Facets Intersection count of another query Commonly seen on shopping and other web sites Solr supports multi-select faceting Range faceting 54

Facets Parameters facet TRUE facet.field fields comma separated facet.query Query to facet on facet.method enum, fc, fcs (near realtime search) 55

Facet Example facet TRUE facet.field fields comma separated facet.query Query to facet on facet.method enum, fc, fcs (near realtime search) 56

Group By Much like SQL group by Sort group values Many options available, sort documents in a group, scroll results per-group No aggregations 57

Highlighting Highlighting re-analyzes each document Fast vector highlighter is faster however requires more storage 58

Highlighting Parameters hl TRUE hl.fl fields comma separated hl.usefastvectorhighlighte r true/false 59

The End jason.rutherglen @thinkbiganalytics.com 60