Using ElasticSearch to Enable Stronger Query Support in Cassandra www.impetus.com
Introduction Relational Databases have been in use for decades, but with the advent of big data, there is a need to use NoSQL databases to handle the enormous amount of accumulating data. This leads to multiple challenges: Many operations such as aggregations, grouping, and ordering cannot be performed with many NoSQL databases Users are unable to use NoSQL data stores due to the restrictions they impose on executing complex and rich queries The learning curve in using these databases is steep Cassandra is one such data store which has restricted query support. However, there are search servers like elasticsearch that enable extensive querying mechanisms over data indexes. This paper provides an overview of Cassandra and Elasticsearch. What is Cassandra? Cassandra is a distributed database management system that easily handles enormous amounts of data. Cassandra is built to do the following: Overcome the challenges of high availability with no single point of failure even as the size of clusters increase. Require almost no configuration to add new nodes in existing clusters due to its elastic scalability Handle increasing amounts of data with minimal changes at any point in time. Provide flexible data storage Allow easy data distribution Provide operational simplicity However, despite these strong features and functions, there are some fundamental limitations such as primitive querying and search capabilities. Most of the NoSQL databases work on the fundamentals of querying/updating records using primary keys, but this is a highly ineffective way of using it. In real world scenarios, NoSQL databases frequently require non-primary keys, such as the price of a commodity using greater/less than values or lists of employees whose address contains xyz as its Street field. Thus, Cassandra lacks the querying capabilities that are often very much needed in real world scenarios. This shortcoming restricts users from leveraging its data storage power to the fullest. 1 2 What is Elasticsearch? Elasticsearch is a powerful search server based on Lucene and is used for realtime indexing. It provides a distributed, multitenant-capable, full-text search engine with a RESTful web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.
It supports features such as: Suggestions Autocomplete Querying Filtering Post filtering Aggregations Aggregation support enables users to perform numerical operations on data which proves to be an exceptionally strong add-on. Elasticsearch is the second most popular enterprise search engine. Aggregations in Elasticsearch Aggregation is one of the most powerful features in Elasticsearch. Think of it as a unit of work that builds analytic information over a set of documents that are indexed in Elasticsearch. This analytic processing with NoSQL databases can be used to perform SQL operations (like sum, min, max, count, order by, group by, having and many more). How does Elasticsearch support stronger queries in Cassandra? One of the most effective approaches we have found to achieve this is by using an indexing store in combination with any of the NoSQL database. Indexing stores can perform several operations directly on secondary indexes. The result of that process consists of IDs that can be used to retrieve the corresponding NoSQL data. In this way, the desired output can be achieved. This process is described in the diagram below: Data Aggregation Request from Client ES query generated by parsing the request NoSQL datastore specific query is generated and aggregations values are passed NoSQL Datastore Response Aggregation is performed over secondary indexes by executing generated ES query which returns the required record IDs Index Store NoSQL native query is executed for record IDs reverted by ES and returns THE RECORDS TO THE CLIENT 3
Here s how it works: 1. The client sends a request 2. This request is passed to the indexing store 3. The request is then sent to Cassandra 4. Then the indexing store performs the aggregation over the secondary indexes stored in it 5. ElasticSearch performs analytics over the indexes as required by the query, then returns the aggregated values and the respective record ID 6. These record IDs are used to send native queries on Cassandra to fetch complete records You may be concerned that in order to use the above approach you will have to learn NoSQL databases as well as the Indexing store s mechanism, but the good news is that there is a tool that allows you to achieve all of the above simply by creating SQ- like queries. As a result, you can start using a NoSQL database along with the Indexing store without learning any native database s API. This tool is called Kundera. Kundera can take a JPA (a popular API) query as an input and provide the desired output using native APIs of data stores and using the index stores that it supports under the hood. The diagram below describes how one can create indexes and run complex queries on data in NoSQL using Kundera in JPA way. For non aggregation query, index store returns the ID s of the records that matches the criteria in the required order. These ID s are then passed to the Client delegator Response Kundera Response Wrapper NoSQL DB returns the queried records which are then passed to response wrapper NoSQL Datastore Client JPA Query For aggregated query Core Engine Client Delegator Client delegator generates client specific query for respective ID s and execute it Core engine generates the corresponding index store query and delegates to it ID 128 129 130 145 210 241 Category X Y X Z X B Query response contains the aggregation results directly computed over secondary indexes in case of aggregated query, otherwise it returns the ID s of the records that matches the query Price 1020 1209 1000 2000 9000 1200 Index Store 4
Capabilities of the Data Lake Here s what s happening in the previous diagram: 1. The client submits a JPA query 2. Kundera s core engine analyzes the query 3. The query is then passed to the elasticsearch client 4. The client generates an elasticsearch aggregation equivalent 5. Elasticsearch processes the query over the secondary index data stored in it 6. Elasticsearch then sends the processed response back to the core engine. 7. The core engine further analyzes this response 8. If the JPA query only needs the aggregations result, then this response is redirected to a response wrapper. Otherwise, it is redirected to the respective NoSQL Client delegator 9. The client delegator generates the client-specific query for the row IDs 10. The row IDs are filtered by the elasticsearch according to criteria 11. It then passes to the NoSQL data store 12. The data store fetches the records and passes it to the response wrapper. 13. The Response wrapper prepares the results in the required manner and returns the result to the client. Example To further demonstrate this with the help of query, let s take an example of stable Product with price as one of its columns. Let s say we need to find minimum price. In SQL, it can be achieved using the following query: Select min(price) from Product However, in Cassandra, there is no direct query support to find the minimum of column values. To find this out, we will have to fetch all the values and process the values to find the minimum, which is a very inefficient approach. The complexity is directly proportional to the number of values in the column. Using Kundera, you can achieve this simply by doing the following: Create an entity corresponding to Product table and specify which columns to be indexed in the following manner: Create an entity corresponding to Product table and specify which columns to be indexed in the following manner: @IndexCollection(columns = { @com.impetus.kundera.index.index(name = "price"), @com.impetus.kundera.index.index(name = "productcategory )}) public class Product { } Specify the indexer in Cassandra s persistence unit: <persistence-unit name="esindexertest"> <provider>com.impetus.kundera.kunderapersistence</provider> <properties> <property name="kundera.nodes" value="localhost" /> <property name="kundera.port" value="9160" /> <property name="kundera.keyspace" value="kunderaexamples"/> 5
<property name="kundera.dialect" value="cassandra" /> <property name="kundera.ddl.auto.prepare" value="create" /> <property name="kundera.client.lookup.class" value="com.impetus.client.cassandra.thrift.thriftclientfactory" /> <property name="kundera.indexer.class" value="com.impetus.client.es.index.esindexer" /> </properties> </persistence-unit> Here, setting the value of property: "kundera.indexer.class" as com.impetus.client.es.index.esindexer" will tell Kundera s Cassandra client to create indexes using Elasticsearch and query the same on Elasticsearch. In the background, Elasticsearch generated indexes while persisting data using Kundera. And now you can use this Min Aggregation of Elasticsearch to simply find the minimum value over the indexed data. Let's see how this happens: Min aggregation returns the minimum value for numeric values extracted from the aggregated indexed documents. Below is the example of a Min aggregation query: { } "aggs" : { "min_price : { min : { field : "price }} } This aggregation returns the minimum value in the price column. In a similar way, the maximum value can also be found. Similar JPA queries can be run using Kundera to find the min and max in product category: Select min(e.price), max(e.price) from Product e where e.productcategory = 'Category1' When a JPA query is received, parsing is done and the query is analyzed by Kundera s core engine. For any aggregation keyword obtained, Elasticsearch generates a query. This query is delegated to Elasticsearch and the desired response is obtained and passed to the response wrapper as shown in diagram. Below is the corresponding Elasticsearch aggregation query: { "aggregations" : { "whereclause" : { "filter" : { "term" : { "productcategory" : "Category1" } }, "aggregations" : { 6 "MIN_price"
"min" : { "field" : "price" } }, "MAX_price" : { "max" : { "field" : "price" } } } } } } In similar manner, Elasticsearch can execute sum, average and count queries over Cassandra. Here s another example: For the same Product table, suppose there is a column Product Category. We need to find records grouped by category of products. In SQL, we can achieve this using the following query: Select * from Product where price > 100 group by productcategory Although there is no query in Cassandra that directly supports Group by, this can be achieved using Kundera s Cassandra + Elasticsearch combination. Kundera translates the above query in Elasticsearch s Term aggregation that groups the records on the basis of the field value. Below is an example that shows how to fetch the record grouped by product category using the following JPA query: Select p from Product where p.price > 100 group by p.productcategory Equivalent Term aggregation query: { "aggregations" : { } } } } } "whereclause" : { "filter" : { "range" : { "price" : { "from" : "100", "to" : null } } }, "aggregations" : { Here s what happens: "group_by" : { "term" : { "field" : "productcategory" } } 1. The Elasticsearch query fetches the records that are grouped according to the category field 2. Elasticsearch returns the record IDs having a price greater than 100 and grouped by the Product Category column value 3. These records IDs are then passed to the client delegator 4. The client delegator generates a Cassandra specific query and executes it to fetch the complete records from the Cassandra In this way we can get records grouped by some column values using Elasticsearch. 7
Using the above approach, we can perform aggregations and other complex queries which are not directly available in Cassandra. There are a lot of other rich aggregations available in Elasticsearch whose strength can be leveraged to query NoSQL data. To summarize, selecting the right indexing mechanism, creating required indexes and using it in combination with a NoSQL can actually lead to enrich the available querying support of a lot of NoSQL data stores. About Impetus 2015 Impetus Technologies, Inc. All rights reserved. Product and company names mentioned herein may be trademarks of their respective companies. June 2015 Impetus is focused on creating big business impact through Big Data Solutions for Fortune 1000 enterprises across multiple verticals. The company brings together a unique mix of software products, consulting services, Data Science capabilities and technology expertise. It offers full life-cycle services for Big Data implementations and real-time streaming analytics, including technology strategy, solution architecture, proof of concept, production implementation and on-going support to its clients. Visit http://bigdata.impetus.com or write to us at bigdata@impetus.com