Using ElasticSearch to Enable Stronger Query Support in Cassandra

Similar documents
Realtime visitor analysis with Couchbase and Elasticsearch

Migrate from Netezza Workload Migration

Oracle NoSQL Database Enterprise Edition, Version 18.1

Automated Netezza Migration to Big Data Open Source

Search Engines and Time Series Databases

Oracle NoSQL Database Enterprise Edition, Version 18.1

RethinkDB. Niharika Vithala, Deepan Sekar, Aidan Pace, and Chang Xu

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Migrate from Netezza Workload Migration

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Search and Time Series Databases

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

SEARCHING BILLIONS OF PRODUCT LOGS IN REAL TIME. Ryan Tabora - Think Big Analytics NoSQL Search Roadshow - June 6, 2013

Stages of Data Processing

Oracle GoldenGate for Big Data

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Amazon Search Services. Christoph Schmitter

Provide Real-Time Data To Financial Applications

Non-Relational Databases. Pelle Jakovits

Active Server Pages Architecture

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION

Road to a Multi-model Database -- making PostgreSQL the most popular and versatile database

Data Mining with Elastic

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM

Big Data Architect.

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

Example Azure Implementation for Government Agencies. Indirect tax-filing system. By Alok Jain Azure Customer Advisory Team (AzureCAT)

Top 5 Considerations When Evaluating NoSQL Databases

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

A NoSQL Introduction for Relational Database Developers. Andrew Karcher Las Vegas SQL Saturday September 12th, 2015

Hibernate Search Googling your persistence domain model. Emmanuel Bernard Doer JBoss, a division of Red Hat

What is a multi-model database and why use it?

A Survey Paper on NoSQL Databases: Key-Value Data Stores and Document Stores

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

ElasticSearch in Production

Copyright 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 13

Schema Management In Hibernate Interview. Questions >>>CLICK HERE<<<

Improving Drupal search experience with Apache Solr and Elasticsearch

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Distributed Databases: SQL vs NoSQL

Data Lake Based Systems that Work

Automated Netezza to Cloud Migration

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

An Introduction to Big Data Formats

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

<Insert Picture Here> MySQL Cluster What are we working on

An UML-XML-RDB Model Mapping Solution for Facilitating Information Standardization and Sharing in Construction Industry

Unifying Big Data Workloads in Apache Spark

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

Understanding the latent value in all content

Percona Live September 21-23, 2015 Mövenpick Hotel Amsterdam

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Chapter 11: Data Management Layer Design

Intro Cassandra. Adelaide Big Data Meetup.

The Technology of the Business Data Lake. Appendix

Polyglot Persistence. EclipseLink JPA for NoSQL, Relational, and Beyond. Shaun Smith Gunnar Wagenknecht

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Oracle Big Data Connectors

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Goal of this document: A simple yet effective

Introduction to NoSQL Databases

Extend NonStop Applications with Cloud-based Services. Phil Ly, TIC Software John Russell, Canam Software

BIG DATA COURSE CONTENT

#MicroFocusCyberSummit

MySQL Cluster Web Scalability, % Availability. Andrew

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Elastify Cloud-Native Spark Application with PMEM. Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge

Security and Performance advances with Oracle Big Data SQL

Amusing algorithms and data-structures that power Lucene and Elasticsearch. Adrien Grand

Topics. History. Architecture. MongoDB, Mongoose - RDBMS - SQL. - NoSQL

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Big Data It s not just for Google Any More

Parsing the request. Part 2 - Creating a filter

A Non-Relational Storage Analysis

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

Serverless Computing. Redefining the Cloud. Roger S. Barga, Ph.D. General Manager Amazon Web Services

Big Data Analytics. Rasoul Karimi

Making Session Stores More Intelligent KYLE J. DAVIS TECHNICAL MARKETING MANAGER REDIS LABS

Oracle Big Data SQL brings SQL and Performance to Hadoop

The dialog boxes Import Database Schema, Import Hibernate Mappings and Import Entity EJBs are used to create annotated Java classes and persistence.

Tour of Database Platforms as a Service. June 2016 Warner Chaves Christo Kutrovsky Solutions Architect

Microservices log gathering, processing and storing

Oracle Essbase XOLAP and Teradata

"Web Age Speaks!" Webinar Series

Tools, tips, and strategies to optimize BEx query performance for SAP HANA

NosDB vs DocumentDB. Comparison. For.NET and Java Applications. This document compares NosDB and DocumentDB. Read this comparison to:

Big Data Analytics using Apache Hadoop and Spark with Scala

Best Practices for Choosing Content Reporting Tools and Datasources. Andrew Grohe Pentaho Director of Services Delivery, Hitachi Vantara

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

ARCHITECTURE ARCHITECTURE OVERVIEW

SQL, Scaling, and What s Unique About PostgreSQL

Turbocharge your MySQL analytics with ElasticSearch. Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017

What is database? Types and Examples

Transcription:

Using ElasticSearch to Enable Stronger Query Support in Cassandra www.impetus.com

Introduction Relational Databases have been in use for decades, but with the advent of big data, there is a need to use NoSQL databases to handle the enormous amount of accumulating data. This leads to multiple challenges: Many operations such as aggregations, grouping, and ordering cannot be performed with many NoSQL databases Users are unable to use NoSQL data stores due to the restrictions they impose on executing complex and rich queries The learning curve in using these databases is steep Cassandra is one such data store which has restricted query support. However, there are search servers like elasticsearch that enable extensive querying mechanisms over data indexes. This paper provides an overview of Cassandra and Elasticsearch. What is Cassandra? Cassandra is a distributed database management system that easily handles enormous amounts of data. Cassandra is built to do the following: Overcome the challenges of high availability with no single point of failure even as the size of clusters increase. Require almost no configuration to add new nodes in existing clusters due to its elastic scalability Handle increasing amounts of data with minimal changes at any point in time. Provide flexible data storage Allow easy data distribution Provide operational simplicity However, despite these strong features and functions, there are some fundamental limitations such as primitive querying and search capabilities. Most of the NoSQL databases work on the fundamentals of querying/updating records using primary keys, but this is a highly ineffective way of using it. In real world scenarios, NoSQL databases frequently require non-primary keys, such as the price of a commodity using greater/less than values or lists of employees whose address contains xyz as its Street field. Thus, Cassandra lacks the querying capabilities that are often very much needed in real world scenarios. This shortcoming restricts users from leveraging its data storage power to the fullest. 1 2 What is Elasticsearch? Elasticsearch is a powerful search server based on Lucene and is used for realtime indexing. It provides a distributed, multitenant-capable, full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.

It supports features such as: Suggestions Autocomplete Querying Filtering Post filtering Aggregations Aggregation support enables users to perform numerical operations on data which proves to be an exceptionally strong add-on. Elasticsearch is the second most popular enterprise search engine. Aggregations in Elasticsearch Aggregation is one of the most powerful features in Elasticsearch. Think of it as a unit of work that builds analytic information over a set of documents that are indexed in Elasticsearch. This analytic processing with NoSQL databases can be used to perform SQL operations (like sum, min, max, count, order by, group by, having and many more). How does Elasticsearch support stronger queries in Cassandra? One of the most effective approaches we have found to achieve this is by using an indexing store in combination with any of the NoSQL database. Indexing stores can perform several operations directly on secondary indexes. The result of that process consists of IDs that can be used to retrieve the corresponding NoSQL data. In this way, the desired output can be achieved. This process is described in the diagram below: Data Aggregation Request from Client ES query generated by parsing the request NoSQL datastore specific query is generated and aggregations values are passed NoSQL Datastore Response Aggregation is performed over secondary indexes by executing generated ES query which returns the required record IDs Index Store NoSQL native query is executed for record IDs reverted by ES and returns THE RECORDS TO THE CLIENT 3

Here s how it works: 1. The client sends a request 2. This request is passed to the indexing store 3. The request is then sent to Cassandra 4. Then the indexing store performs the aggregation over the secondary indexes stored in it 5. ElasticSearch performs analytics over the indexes as required by the query, then returns the aggregated values and the respective record ID 6. These record IDs are used to send native queries on Cassandra to fetch complete records You may be concerned that in order to use the above approach you will have to learn NoSQL databases as well as the Indexing store s mechanism, but the good news is that there is a tool that allows you to achieve all of the above simply by creating SQ- like queries. As a result, you can start using a NoSQL database along with the Indexing store without learning any native database s API. This tool is called Kundera. Kundera can take a JPA (a popular API) query as an input and provide the desired output using native APIs of data stores and using the index stores that it supports under the hood. The diagram below describes how one can create indexes and run complex queries on data in NoSQL using Kundera in JPA way. For non aggregation query, index store returns the ID s of the records that matches the criteria in the required order. These ID s are then passed to the Client delegator Response Kundera Response Wrapper NoSQL DB returns the queried records which are then passed to response wrapper NoSQL Datastore Client JPA Query For aggregated query Core Engine Client Delegator Client delegator generates client specific query for respective ID s and execute it Core engine generates the corresponding index store query and delegates to it ID 128 129 130 145 210 241 Category X Y X Z X B Query response contains the aggregation results directly computed over secondary indexes in case of aggregated query, otherwise it returns the ID s of the records that matches the query Price 1020 1209 1000 2000 9000 1200 Index Store 4

Capabilities of the Data Lake Here s what s happening in the previous diagram: 1. The client submits a JPA query 2. Kundera s core engine analyzes the query 3. The query is then passed to the elasticsearch client 4. The client generates an elasticsearch aggregation equivalent 5. Elasticsearch processes the query over the secondary index data stored in it 6. Elasticsearch then sends the processed response back to the core engine. 7. The core engine further analyzes this response 8. If the JPA query only needs the aggregations result, then this response is redirected to a response wrapper. Otherwise, it is redirected to the respective NoSQL Client delegator 9. The client delegator generates the client-specific query for the row IDs 10. The row IDs are filtered by the elasticsearch according to criteria 11. It then passes to the NoSQL data store 12. The data store fetches the records and passes it to the response wrapper. 13. The Response wrapper prepares the results in the required manner and returns the result to the client. Example To further demonstrate this with the help of query, let s take an example of stable Product with price as one of its columns. Let s say we need to find minimum price. In SQL, it can be achieved using the following query: Select min(price) from Product However, in Cassandra, there is no direct query support to find the minimum of column values. To find this out, we will have to fetch all the values and process the values to find the minimum, which is a very inefficient approach. The complexity is directly proportional to the number of values in the column. Using Kundera, you can achieve this simply by doing the following: Create an entity corresponding to Product table and specify which columns to be indexed in the following manner: Create an entity corresponding to Product table and specify which columns to be indexed in the following manner: @IndexCollection(columns = { @com.impetus.kundera.index.index(name = "price"), @com.impetus.kundera.index.index(name = "productcategory )}) public class Product { } Specify the indexer in Cassandra s persistence unit: <persistence-unit name="esindexertest"> <provider>com.impetus.kundera.kunderapersistence</provider> <properties> <property name="kundera.nodes" value="localhost" /> <property name="kundera.port" value="9160" /> <property name="kundera.keyspace" value="kunderaexamples"/> 5

<property name="kundera.dialect" value="cassandra" /> <property name="kundera.ddl.auto.prepare" value="create" /> <property name="kundera.client.lookup.class" value="com.impetus.client.cassandra.thrift.thriftclientfactory" /> <property name="kundera.indexer.class" value="com.impetus.client.es.index.esindexer" /> </properties> </persistence-unit> Here, setting the value of property: "kundera.indexer.class" as com.impetus.client.es.index.esindexer" will tell Kundera s Cassandra client to create indexes using Elasticsearch and query the same on Elasticsearch. In the background, Elasticsearch generated indexes while persisting data using Kundera. And now you can use this Min Aggregation of Elasticsearch to simply find the minimum value over the indexed data. Let's see how this happens: Min aggregation returns the minimum value for numeric values extracted from the aggregated indexed documents. Below is the example of a Min aggregation query: { } "aggs" : { "min_price : { min : { field : "price }} } This aggregation returns the minimum value in the price column. In a similar way, the maximum value can also be found. Similar JPA queries can be run using Kundera to find the min and max in product category: Select min(e.price), max(e.price) from Product e where e.productcategory = 'Category1' When a JPA query is received, parsing is done and the query is analyzed by Kundera s core engine. For any aggregation keyword obtained, Elasticsearch generates a query. This query is delegated to Elasticsearch and the desired response is obtained and passed to the response wrapper as shown in diagram. Below is the corresponding Elasticsearch aggregation query: { "aggregations" : { "whereclause" : { "filter" : { "term" : { "productcategory" : "Category1" } }, "aggregations" : { 6 "MIN_price"

"min" : { "field" : "price" } }, "MAX_price" : { "max" : { "field" : "price" } } } } } } In similar manner, Elasticsearch can execute sum, average and count queries over Cassandra. Here s another example: For the same Product table, suppose there is a column Product Category. We need to find records grouped by category of products. In SQL, we can achieve this using the following query: Select * from Product where price > 100 group by productcategory Although there is no query in Cassandra that directly supports Group by, this can be achieved using Kundera s Cassandra + Elasticsearch combination. Kundera translates the above query in Elasticsearch s Term aggregation that groups the records on the basis of the field value. Below is an example that shows how to fetch the record grouped by product category using the following JPA query: Select p from Product where p.price > 100 group by p.productcategory Equivalent Term aggregation query: { "aggregations" : { } } } } } "whereclause" : { "filter" : { "range" : { "price" : { "from" : "100", "to" : null } } }, "aggregations" : { Here s what happens: "group_by" : { "term" : { "field" : "productcategory" } } 1. The Elasticsearch query fetches the records that are grouped according to the category field 2. Elasticsearch returns the record IDs having a price greater than 100 and grouped by the Product Category column value 3. These records IDs are then passed to the client delegator 4. The client delegator generates a Cassandra specific query and executes it to fetch the complete records from the Cassandra In this way we can get records grouped by some column values using Elasticsearch. 7

Using the above approach, we can perform aggregations and other complex queries which are not directly available in Cassandra. There are a lot of other rich aggregations available in Elasticsearch whose strength can be leveraged to query NoSQL data. To summarize, selecting the right indexing mechanism, creating required indexes and using it in combination with a NoSQL can actually lead to enrich the available querying support of a lot of NoSQL data stores. About Impetus 2015 Impetus Technologies, Inc. All rights reserved. Product and company names mentioned herein may be trademarks of their respective companies. June 2015 Impetus is focused on creating big business impact through Big Data Solutions for Fortune 1000 enterprises across multiple verticals. The company brings together a unique mix of software products, consulting services, Data Science capabilities and technology expertise. It offers full life-cycle services for Big Data implementations and real-time streaming analytics, including technology strategy, solution architecture, proof of concept, production implementation and on-going support to its clients. Visit http://bigdata.impetus.com or write to us at bigdata@impetus.com