Technical Deep Dive: Cassandra + Solr. Copyright 2012, Think Big Analy7cs, All Rights Reserved

Similar documents
Soir 1.4 Enterprise Search Server

Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć sematext.com

rpaf ktl Pen Apache Solr 3 Enterprise Search Server J community exp<= highlighting, relevancy ranked sorting, and more source publishing""

Realtime visitor analysis with Couchbase and Elasticsearch

SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME. Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013

High Performance Solr. Shalin Shekhar Mangar

Cassandra 1.0 and Beyond

Intro Cassandra. Adelaide Big Data Meetup.

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Glossary. Updated: :00

June 20, 2017 Revision NoSQL Database Architectural Comparison

Search Engines and Time Series Databases

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Enterprise Search with ColdFusion Solr. Dan Sirucek cf.objective 2012 May 2012

EPL660: Information Retrieval and Search Engines Lab 3

Goal of this document: A simple yet effective

Efficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

A Scotas white paper September Scotas OLS

Indexing and Search with

CIB Session 12th NoSQL Databases Structures

Cassandra 2012: What's New & Upcoming. Sam Tunnicliffe

A Non-Relational Storage Analysis

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Column-Family Databases Cassandra and HBase

Search and Time Series Databases

Couchbase Architecture Couchbase Inc. 1

Parallel SQL and Streaming Expressions in Apache Solr 6. Shalin Shekhar Lucidworks Inc.

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

Tools for Social Networking Infrastructures

Distributed computing: index building and use

Oracle NoSQL Database Enterprise Edition, Version 18.1

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

A short introduction to the development and evaluation of Indexing systems

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Improving Drupal search experience with Apache Solr and Elasticsearch

Relevancy Workbench Module. 1.0 Documentation

Click to add text IBM Collaboration Solutions

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider. Group 5 Ajantha Ramineni, Sahil Tiwari, Rishabh Jain, Shivang Gupta

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

What s New in DataStax Enterprise 3.1? A Guide for Developers, Architects and IT Managers. White Paper BY DATASTAX CORPORATION November 2013

EPL660: Information Retrieval and Search Engines Lab 2

Module 9: Managing Schema Objects

Ghislain Fourny. Big Data 5. Wide column stores

Massively scalable NoSQL with Apache Cassandra! Jonathan Ellis Project Chair, Apache Cassandra CTO,

Ghislain Fourny. Big Data 5. Column stores

CS 655 Advanced Topics in Distributed Systems

State of the Dolphin Developing new Apps in MySQL 8

Apache Lucene 4. Robert Muir

DATABASE DESIGN II - 1DL400

NPTEL Course Jan K. Gopinath Indian Institute of Science

Is Elasticsearch the Answer?

Introduction to IR Systems: Supporting Boolean Text Search

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Performance Best Practices Paper for IBM Tivoli Directory Integrator v6.1 and v6.1.1

MySQL Architecture and Components Guide

Cloudera Kudu Introduction

API Gateway Version September Key Property Store User Guide

Lucene 4 - Next generation open source search

CS November 2018

Hadoop & Big Data Analytics Complete Practical & Real-time Training

ADVANCED HBASE. Architecture and Schema Design GeeCON, May Lars George Director EMEA Services

Using space-filling curves for multidimensional

Shark: Hive (SQL) on Spark

elasticsearch The Road to a Distributed, (Near) Real Time, Search Engine Shay Banon

C exam. Number: C Passing Score: 800 Time Limit: 120 min IBM C IBM Cloud Platform Application Development

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Oracle NoSQL Database Enterprise Edition, Version 18.1

LAB 7: Search engine: Apache Nutch + Solr + Lucene

Inventory (input to ECOMP and ONAP Roadmaps)

Apache Lucene - Overview

CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Semantic Web Technologies. Topic: RDF Triple Stores

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Comparing SQL and NOSQL databases

Course Content MongoDB

Major Features: Postgres 10

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

mysolr Documentation Release Rubén Abad, Miguel Olivares

whitepaper RediSearch: A High Performance Search Engine as a Redis Module

LucidWorks: Searching with curl October 1, 2012

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

MySQL Cluster Web Scalability, % Availability. Andrew

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Cisco ParStream Cisco ParStream DSA Link Guide

CSE 544 Principles of Database Management Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

CS November 2017

Apache Lucene Eurocon: Preview

TALK 1: CONVINCE YOUR BOSS: CHOOSE THE "RIGHT" DATABASE. Prof. Dr. Stefan Edlich Beuth University of Technology Berlin (App.Sc.)

HBASE INTERVIEW QUESTIONS

Migrating to Cassandra in the Cloud, the Netflix Way

Transcription:

Technical Deep Dive: Cassandra + Solr Confiden7al

Business case 2

Super scalable realtime analytics Hadoop is fantastic at performing batch analytics Cassandra is an advanced column family oriented system Solr offers realtime analytics like a traditional RDBMS (except joins) 3

What is Solr? 4

Lucene High performance inverted index: Java based Embeddable library... 5

Solr Distributed search Facets Schemas Dismax queries 6

Terms Posting List Term to integer document id list. dog = [0,3,6,7,9] cat = [1,2,3,5,9] Terms are stored in sorted order. 7

Query Execution Query is parsed into terms Each term is looked up from the terms dictionary For each term, the posting list is iterated, and conjoined or disjoined with the other term s posting lists 8

Datastax Enterprise (DSE) 9

DSE Cluster 10

Datastax Enterprise Combines Cassandra with Solr Best of both worlds Distributed Dynamo based data distribution Reliable proven scalability Lucene and Solr 4.0 11

DSE Solr Features Near realtime search Multiple data centers Reindex directly from Cassandra Fast transaction log Run MapReduce on Solr data Realtime analytics 12

DSE Solr Architecture Extends Cassandra secondary index API Distributes queries using ring topology over HTTP Data stored in Cassandra Lucene index stored on each node directly on the OS filesystem (index is not stored in Cassandra) Index per column family only 13

DSE Solr Architecture Schema and configuration stored in Cassandra Updates can hit any server, routed to the correct node(s) automatically RandomPartitioner MD5 hashes documents / rows to the correct node(s) 14

Architecture How Solr is integrated into Cassandra 15

DSE Solr Search Queries are automatically distributed to online nodes in the cluster When replication factor > 1, queries are load balanced 16

DSE Solr Commit Log Commit log is sync with Solr If a node crashes, no data is lost, the commit log is replayed on restart 17

DSE Solr Data Model 18

DSE Best Practices 19

Production Increase replication factor for more queries per second Like Cassandra, allocate enough RAM, the system IO cache determines queries per second and query latency 20

Heap Space Field caches used by sorting and facets Terms dictionary index The index is not loaded into heap Rely on the system IO cache 21

Loading Configuration Files into DSE DSE stores the configuration files in Cassandra Same configuration files used for each node Use curl to HTTP POST the schema.xml and solrconfig.xml files into DSE 22

Near Realtime Search Use DSENRTCachingDirectoryFactory Small segments flushed to RAM Once large enough, the small segments are flushed to disk Set autosoftcommit to 1-5 seconds Reduce or eliminate the auto-warming in caches 23

Validation Log DSE Search stores Solr analyzing errors in the validation log /var/log/cassandra/solrvalidation.log 24

DSENRTCachingDirectory Factory maxmergesizemb - The threshold (MB) for writing a merge segment to a RAMDirectory or to the file system maxcachemb - The maximum value (MB) of the RAMDirectory 25

Using DSE Comes with Wikipedia demonstration application Here is a quick example 26

Query using CQL Solr queries may be executed via CQL Here is a quick example SELECT title FROM solr WHERE solr_query='title:b*'; 27

Resource URL Configuration files are stored in Cassandra Same configuration per column family http://<host>:<port>/solr/resource/ <keyspace>.<columnfamily>/ <filename>.<ext> 28

Solr Admin Console http://localhost:8983/solr/wiki.solr/admin/ 29

Rebuilding an Index Indexes can be rebuilt Rebuilding is useful when the schema changes or the index has become corrupted./bin/dsetool rebuild_indexes wiki solr 30

Turn on Compression Text can usually be compressed by a large factor Turning on compression enables more data to use to system IO cache UPDATE COLUMN FAMILY solr WITH compression_options= {sstable_compression:snappycompressor, chunk_length_kb:64}; 31

General Solr 32

Important Ideas Queries Documents and Fields Analyzers Segments Schema 33

Documents and Fields Lucene indexes documents Document consist of fields Fields consist of a name and one or more values 34

Analyzers Convert text into tokens / terms Records the position of each token Converts tokens as per design, such as stemming 35

Segments Lucene stores the index in discrete units called segments A merge policy is set for how and when to merge (like compact) segments At query time, segments are accessed 36

Schema Structure First field types are defined such as primitives, then text fields and their analyzers 37

Schema Type Mapping Solr field types are mapped to native Cassandra types Solr Type Cassandra Type TextField UTF8Type LongField LongType IntField Int32Type StringField UTF8Type 38

Query Overview Solr queries offer many of the same features as SQL (except joins) Powerful, expressive, and fast 39

Query Types Search on any number of fields with boolean logic (AND, OR, +, -) Sort results per field similar to SQL Range queries Phrase queries Regular expression queries Query boosting (DisMax) 40

Filter Queries Cached bit sets No score calculated Good for queries with many results that are reused such as types or access controls 41

Debug Queries Pass in debug=true Provides info about timing of components Debug info about the query Debug info about the result scoring 42

Sort By Solr queries offer many of the same features as SQL (except joins) Powerful, expressive, and fast 43

Range Queries createdate: [1999-01-01T23:59:59.999Z TO *] field:[* TO 100] -field:[* TO *] finds all documents without a value for field 44

Phrase Query "data stax"~4 Search for "data and stax" within 4 words of each other 45

Prefix Queries myfield:foo* Queries cannot begin with an asterik 46

Regular Expressions Use forward slash to demarcate a regular expression query Match on a five-digit zip code body:/[0-9]{5}/ 47

Spatial Queries Bounding box Distance Filtering based on distance 48

Auto Suggest Uses SpellCheckComponent Spellcheck / suggest is built from an existing index Can be set to automatically rebuild the suggest index on commit 49

Prefix Auto Suggest It is recommended to use FSTLookup or WFSTLookup They are more memory efficient 50

Auto Suggest Parameters spellcheck TRUE spellcheck.dictionary suggest spellcheck.onlymorepopular TRUE spellcheck.count 5 (number of suggestions returned) StringField UTF8Type 51

Auto Suggest by Popular Queries Prefix based auto-suggest can be limiting Use EdgeNGramFilterFactory to query within terms Sort results by a hit count field 52

Dismax Query Parser Dismax query parser provides query time field level boosting granularity, with less special syntax Dismax generally makes the best first choice query parser for user facing Solr applications 53

Facets Intersection count of another query Commonly seen on shopping and other web sites Solr supports multi-select faceting Range faceting 54

Facets Parameters facet TRUE facet.field fields comma separated facet.query Query to facet on facet.method enum, fc, fcs (near realtime search) 55

Facet Example facet TRUE facet.field fields comma separated facet.query Query to facet on facet.method enum, fc, fcs (near realtime search) 56

Group By Much like SQL group by Sort group values Many options available, sort documents in a group, scroll results per-group No aggregations 57

Highlighting Highlighting re-analyzes each document Fast vector highlighter is faster however requires more storage 58

Highlighting Parameters hl TRUE hl.fl fields comma separated hl.usefastvectorhighlighte r true/false 59

The End jason.rutherglen @thinkbiganalytics.com 60

61