Technical Sheet NITRODB Time-Series Database

Similar documents
Part 1: Indexes for Big Data

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Oracle Big Data Connectors

Security and Performance advances with Oracle Big Data SQL

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

An Introduction to Big Data Formats

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Accelerate Big Data Insights

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Approaching the Petabyte Analytic Database: What I learned

Was ist dran an einer spezialisierten Data Warehousing platform?

Stages of Data Processing

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Lambda Architecture for Batch and Stream Processing. October 2018

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

DATA SCIENCE USING SPARK: AN INTRODUCTION

Cloud Computing & Visualization

Microsoft Exam

Data Analytics at Logitech Snowflake + Tableau = #Winning

WHITEPAPER. MemSQL Enterprise Feature List

microsoft

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Massive Scalability With InterSystems IRIS Data Platform

Evolving To The Big Data Warehouse

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved.

Modern Data Warehouse The New Approach to Azure BI

ELTMaestro for Spark: Data integration on clusters

1Z0-526

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Introduction to Big-Data

Key Differentiators. What sets Ideal Anaytics apart from traditional BI tools

A Tutorial on Apache Spark

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

BI ENVIRONMENT PLANNING GUIDE

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Time Series Analytics with Simple Relational Database Paradigms Ben Leighton, Julia Anticev, Alex Khassapov

MapR Enterprise Hadoop

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Flash Storage Complementing a Data Lake for Real-Time Insight

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

Case Study: Tata Communications Delivering a Truly Interactive Business Intelligence Experience on a Large Multi-Tenant Hadoop Cluster

Tuning Enterprise Information Catalog Performance

EsgynDB Enterprise 2.0 Platform Reference Architecture

Cloudera Kudu Introduction

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Enterprise Data Catalog for Microsoft Azure Tutorial

Exam Questions

Apache Hive for Oracle DBAs. Luís Marques

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Shine a Light on Dark Data with Vertica Flex Tables

Overview of Data Services and Streaming Data Solution with Azure

Big Data Infrastructures & Technologies

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Configuring and Deploying Hadoop Cluster Deployment Templates

Postgres Plus and JBoss

REGULATORY REPORTING FOR FINANCIAL SERVICES

VOLTDB + HP VERTICA. page

GPU Accelerated Data Processing Speed of Thought Analytics at Scale

Data-Intensive Distributed Computing

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Informatica Enterprise Information Catalog

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

In-Memory Data Management Jens Krueger

Introduction to Database Services

Datacenter replication solution with quasardb

Spark Overview. Professor Sasu Tarkoma.

Cloud Analytics and Business Intelligence on AWS

Modernizing Business Intelligence and Analytics

Actian SQL Analytics in Hadoop

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

AUTOMATIC CLUSTERING PRASANNA RAJAPERUMAL I MARCH Snowflake Computing Inc. All Rights Reserved

Big Data Hadoop Course Content

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Talend Big Data Sandbox. Big Data Insights Cookbook

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

OLAP Introduction and Overview

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

Massive Online Analysis - Storm,Spark

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Hortonworks DataFlow Sam Lachterman Solutions Engineer

Microsoft SQL Server Training Course Catalogue. Learning Solutions

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Transcription:

Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost

INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes of time-series data at lightning fast speeds with an extremely low hardware requirement. It has been designed with a single goal in mind: Deliver 10X query performance using 1/10th the architecture and technology that powers this vision. Some of the key features that enable this are:!"#$"%&!!!!"#$%&'&$!()"*+))$,*'+--"#+*.+$/0-('"0*" 1.Intelligent Data Storage Data is stored on disk as Parquet, which is the columnar data format for Hadoop. This format avoids reading of columns, which are not needed by the query in contrast to conventional row-oriented databases, which read all columns for a row thus leading to high disk I/O. Moreover, Parquet provides a number of compression & encoding techniques which not only reduces the overall storage footprint but also dramatically improves query performance by reducing CPU, memory, and disk I/O requirements at processing time. Original data size is generally reduced by 90% with zero data loss even with high- availability redundancy turned on.!" #$%&'()*+%,-.%,/'!0-+/.1&2'-.'3&-+4' 2.Intelligent Data Processing The system generates an intelligent mix of specific materialized views (for frequently accessed data) as well as primary indexes (BTree, Bloom filter based) on the ingested raw data and stores them as segments distributed across the cluster. Only the index segments / views that are required by a query are targeted thus reducing volume of data that needs to be read and processed. Since fewer CPU cores are required to process each query, the unused computing resources are made available for other queries so that they can be run in parallel.! FEATURES! The system also includes several other features designed to offer performance, scalability, reliability, and ease of use. These include: 1.Local caching 54-+#6174'89210422':0.4++1;40&4 Repeatedly issuing I/O requests to the same ' few popular columns, >4/'?4-.9,42'@'8404<1.2' indexes and metadata files is time consuming. To! prevent this, frequently accessed column and index blocks as well as small metadata files are automatically cached. SIGVIEW is Sigmoid s fully managed full stack SIGVIEW has been designed for ad-hoc exploratory This reduces the number of round-trips to the HDFS NameNode Business Intelligence solution that combines Analytics where the at scale data resides so that resulting it is able in superior to respond performance. to your revolutionary in-memory columnar database, analytics queries in less than a few seconds. Reporting which engine and interactive visual frontend. It delivers a full used to take hours is now ready in minutes. range 2.Adaptive of capabilities query - and Ingest index over cache a million events per second, When ad-hoc working with query a BI on tool, petabytes it is very common of data to with have many repeating A%+970-,'3.%,-;4' (or similar) queries ' generated by dashboards response and pre-built times reports. in seconds, Also, many interactive ad-hoc explorations, dashboard, while eventually We use innovative diving deep data into encoding a unique subset & storage of the technology data, notifications typically start and by progressively alerts and adding enterprise filters on reporting. top of the logical to data compress model. Our data engine by takes several advantage folds. of this Vertically SIGVIEW is based on Apache Spark and easily predictable BI user behavior pattern by automatically caching result partitioning sets both the data intermediate means we and only near read final the ones data and that integrates with an organizations existing IT is required reducing infrastructure overhead & improve infrastructure transparently at the using lowest them TCO when and applicable. highest ROI. The adaptive cache performance. is also incremental. For example, assume that query results were cached and later new rows (containing information used by the query) are added. If the query is fired again, in most cases, the engine will still leverage the cached results. (0,1&$4"':021;$.2'<%,'(=4,/%04'?9++/'B3&-+4 It will compute #C9.D'!,&$1.4&.9,4 the query results only on the delta and will merge it with the cached results to find the final query output. As your It data will then volume replace increases; the old cached we simply results add with more the! new ones. This mechanism is transparent to the user, who will machines never see stale on the or inconsistent fly to handle data. the increased volume. SIGVIEW provides complete and enriched insights to Moreover you are not stuck with one monolithic all employees across all levels in an organization and machine or appliance. Our platform can horizontally not 3.Scale-Out just analysts. Architecture Information is optimized for a user s scale to 1000s of nodes. role As by you leveraging add more a concurrent unified BI users foundation or as your that query has the load starts to increase, you can easily start additional nodes to capability support of them, ingesting and remove data from them different when they data are sources. no longer needed. It can source both structured as well as unstructured :0"4)10;'@'10 #747%,/'*,%E4&.1%02 data, store in a single place without cubing or preaggregation so that business users can access memory projections of the data to significantly speed We use statistical indexes like bloom filters & in- 4. Fully Fault Tolerant information The system in uses the Mesos most for intuitive ensuring logical fault tolerance. manner A single "hot" up data master access runs in & a reduce cluster among query with processing some number time by through of standby interactive nodes that dashboards, are monitored ad-hoc by Zookeeper.Zookeeper queries, 10x monitors over other all the warehousing nodes the solutions master cluster and manages the and election API function of a new calls. master when the hot master node fails. Data replication is handled by reports HDFS. C*.171F4"'<%,'6174'34,142'G-.- ' SIGVIEW platform is optimized to store and analyze any time-series data. Hence it is not restricted to a 5. Easy and Flexible Deployment on cluster particular industry but can be used across retail, We support one-click deployment on AWS, GCE and Azure via banking, our custom logistics, cloud advertising management technology tool, akin to etc. a custom "recipe" on Chef. Real time data to real time decisions

ARCHITECTURE The system broadly consists of four (4) major components which are discussed in detail below: 1.Data Ingestion 2.Data Preparation 3.Data Manager 4.Query Engine DATA INGESTION Connectors to various standard data sources such as Amazon S3, Apache Kafka, HDFS, transactional databases, CRM systems and REST API can be used to ingest data into the system at user-defined, configurable, periodic intervals (which can be configured to be as low as 1 min). All standard file formats for ingestion i.e. CSV, JSON, XML, PARQUET etc. are natively supported DATA PREPARATION Data ingested into the system is converted to Parquet by the Spark based Wrangler Module. This module is also responsible for running data validations and maintaining the schema of the data throughout the system uniform. Its pluggable architecture also allows complex user-specific workflow logic (entire ETL pipelines for example) to be implemented. DATA PREPARATION PHASES Internally, an indexer module stores data in a form that ensures each projection of data represents one or more attributes of a logical table. This is completely transparent to the end user who will only view the underlying data as a single Logical Entity. This allows the system to access only the data actually required by the query. It also ensures that data is distributed across several projections, which further improves performance.if certain columns are always read together, e.g. PUBLISH- ER, COUNTRY, REVENUE, then these can be grouped together so that they are retrieved in a single I/O. Superfast decisions based on superfast insights

Projections of data may have some sort order associated with them that specifies the Horizontal Segmentation or Partitioning logic of the data. Details of these row groups are stored as Metadata by the Data Manager. Data Manager then automatically selects and returns a best set of overlapping projections for a particular table based on columns required by the fired query. Best set of projections is governed by the cost metrics defined for that particular projection. Though it may seem as if redundantly storing data in multiple projections is waste of disk space, encoding schemes in Parquet ensure that the resultant projections are only a fraction of the raw data. This reduces the amount of data to be read off the disk by the Query Engine, thus increasing Query Performance. Recognizing the fact that businesses often have ETL pipelines already implemented, this component has been made dynamic so that it can serve either as a complete ETL engine or as a component to simply convert already transformed data to the format required by the database. DATA MANAGER Data Manager acts as the central nervous system of the product, responsible for coordinating Data Ingestion, Data Preparation as well as Querying tasks. It consists of the following 2 modules: 1. Query Planner Query Planning is broken down into 3 phases i. Logical Plan Analysis ii. Physical Planning iii. Projection Planning In the Projection planning phase, Query Planner may generate multiple plans and compare them based on execution cost. All other phases are purely rule-based. Each phase uses different types of tree nodes; Query Planner contains the library for tree nodes for each of Logical, Physical and Projection Tree and data types required by each of them. PHASES OF QUERY PLANNING IN DATA MANAGER Logical Plan Each query contains relations which are computed in the form of an Abstract Syntax Tree. The raw query may contain several unresolved attribute references or relations.for example, in the SQL Query: select metric from Table, metric may not be a valid name and may be represented as mtr in the logical data. These logical operators are resolved using Data Manager Configuration Catalog which contains information regarding all the data sources along with their relation to the physical data. It starts with unresolved logical query and applies rules that do the following: 1. Look up attributes in Logical Source from the Configuration Catalog 2. Pipelining various operations to a single operation if possible 3. Optimize by pushing Aggregates below filters wherever feasible to read minimum amount of data as possible. Physical Plan In the physical planning phase, Data Manager takes a Logical Plan and generates a corresponding physical plan. This is done by mapping the appropriate Logical Operator/Operand to the corresponding Physical Operator/Operand using the relations mentioned in Data Manager Configuration Catalog. This may result in injection of implicit JOIN operations into the plan where Operands from 2 or more physical tables are used in the logical query. E.g. In Query: select Correlation(Temperature, Sales) from table where Store= Florida_Store_1 and TransactionDate > 2015-01-01 and TransationDat e< 2015-03-01 ; the temperature column may actually be present in some different physical source other than the transaction physical table. This may require us to JOIN the weather data with the transaction data. In depth insights into your data in Real Time

Projection Plan Projection Plan is the plan, which is understood by the Query Engine for execution. This plan modifies and optimizes the physical plan based on the different projections available for different operations. One or more projection plans may be generated from a single Physical Plan if multiple projections are present for a single Physical Plan. Cost based optimization algorithms are applied to select the most optimal plan from the different competing projection plans. Cost metrics are the cost of performing a particular operation on a projection. These may result in change of aggregate operations and algorithms used for joins used in the Physical Plan. 2. Ingestion Manager This module is responsible for triggering and managing execution of all ingestion related modules. This module also stores the metadata with the Ingestion Manager, which is later used by the Query Planner to decide the best set of projections for the query. The metadata file includes the cardinality information for each dimension and inverted indices. During processing, query engine would lookup the metadata file and return a list of segment ids. Number of segment ids returned is passed to the job server, which helps determine number of CPU cores to be used. This leads to significantly lower CPU utilization. QUERY ENGINE Query Engine is an execution module build on top of Apache Spark which is a fast and general-purpose cluster computing system. Spark stores data in-memory and uses a powerful data abstraction paradigm, resilient distributed datasets (RDD), which is a clever way of guaranteeing fault tolerance and minimizes network I/O. It has the ability to cache datasets in memory for interactive data analysis: extract a working set, cache it and query it repeatedly. Query Engine translates the Projection Plan generated by the Data Manager to an equivalent Spark Dataframe Query, which is then executed on the Spark Cluster. Apart from complete execution of the Query, it supports sampling of results in order to help Query Planner decide on a better set of projections. Once the query is executed on the cluster, it hands over the result to the UI server, which transmits this information to the dashboard. IMPROVEMENTS TO OPEN-SOURCE SPARK There are a few compatible improvements which have been made to the open-source version of Spark for performance reasons. For example, in cases where the query involves a JOIN Operation of a large table with a smaller table, Query Engine ensures that the smaller table is broadcasted across all Spark Executors and retained across jobs. This is in contrast to normal Spark where the table is not retained across different queries and requires to be broadcasted every time. This has helped improve performance for costly operations like JOIN by orders of magnitude. We are in the process of contributing this back to the open-source community. SUPPORTED FUNCTIONS Basic - Average, Count, Min, Max, Sum, Product, Percentage, First, Last, etc. Advanced - Correlation, Covariance, Trigonometric Functions, Power, Calculated Metrics (e.g. Sales/Quantity as AUR), Count Distinct (Exact as well as Approximation) etc. UPCOMING FEATURES 1. Data type specific encoding & compression 2. In-memory columnar storage 3. Vectorization using SIMD instructions Real time data to real time decisions

QUERY PROCESSING WORKFLOW PERFORMANCE BENCHMARKS (CONDUCTED BY ONE OF OUR CUSTOMERS) Query description Impala Sigmoid (Spark) Actian Vortex Vertica One metric, hourly, one week 18s 0.8s 1.2s 1.4s All metrics, hourly, one week 28s 0.8s 6.3s 8.7s All metrics hourly, one week, one filter 21s 0.9s 3.2s 3.6s All metrics hourly, two filters 38s 2s 2.8s 0.9s Group by, one week, no filter, all metrics Group by, one week, no filter, all metrics Group by, one week, no filter, one metric Group by, all metrics, one week, one filter Group by, all metrics, one week, two filters 35s 1.6s 7.7s 0.9s 28s 1.6s 7.8s - 17s 1s 2.4s 1.2s 26s 3.2s 3.4s 1.3s 38s 3.2s 12.7s 1.6s Superfast decisions based on superfast insights

1) LOGIN VIEW 2) COMPARISON VIEW In depth insights into your data in Real Time

3) METRIC ADDITION 4) DIMENSION ADDITION Real time data to real time decisions

CONSTRUCT DASHBOARDS CREATE FREE-FLOW CHARTS (DRAG-DROP DIMENSIONS / MEASURES) Superfast decisions based on superfast insights

Sigmoid contact@sigmoid.com 1343 Kingfisher Way, Sunnyvale, CA, 94087 +1 760 203 3257