Technical Sheet NITRODB Time-Series Database

Size: px

Start display at page:

Download "Technical Sheet NITRODB Time-Series Database"

Georgia Tyler
6 years ago
Views:

1 Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost

2 INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes of time-series data at lightning fast speeds with an extremely low hardware requirement. It has been designed with a single goal in mind: Deliver 10X query performance using 1/10th the architecture and technology that powers this vision. Some of the key features that enable this are:!"#$"%&!!!!"#$%&'&$!()"*+))$,*'+--"#+*.+$/0-('"0*" 1.Intelligent Data Storage Data is stored on disk as Parquet, which is the columnar data format for Hadoop. This format avoids reading of columns, which are not needed by the query in contrast to conventional row-oriented databases, which read all columns for a row thus leading to high disk I/O. Moreover, Parquet provides a number of compression & encoding techniques which not only reduces the overall storage footprint but also dramatically improves query performance by reducing CPU, memory, and disk I/O requirements at processing time. Original data size is generally reduced by 90% with zero data loss even with high- availability redundancy turned on.!" #$%&'()*+%,-.%,/'!0-+/.1&2'-.'3&-+4' 2.Intelligent Data Processing The system generates an intelligent mix of specific materialized views (for frequently accessed data) as well as primary indexes (BTree, Bloom filter based) on the ingested raw data and stores them as segments distributed across the cluster. Only the index segments / views that are required by a query are targeted thus reducing volume of data that needs to be read and processed. Since fewer CPU cores are required to process each query, the unused computing resources are made available for other queries so that they can be run in parallel.! FEATURES! The system also includes several other features designed to offer performance, scalability, reliability, and ease of use. These include: 1.Local caching 54-+#6174' ':0.4++1;40&4 Repeatedly issuing I/O requests to the same ' few popular columns, >4/'?4-.9,42'@'8404<1.2' indexes and metadata files is time consuming. To! prevent this, frequently accessed column and index blocks as well as small metadata files are automatically cached. SIGVIEW is Sigmoid s fully managed full stack SIGVIEW has been designed for ad-hoc exploratory This reduces the number of round-trips to the HDFS NameNode Business Intelligence solution that combines Analytics where the at scale data resides so that resulting it is able in superior to respond performance. to your revolutionary in-memory columnar database, analytics queries in less than a few seconds. Reporting which engine and interactive visual frontend. It delivers a full used to take hours is now ready in minutes. range 2.Adaptive of capabilities query - and Ingest index over cache a million events per second, When ad-hoc working with query a BI on tool, petabytes it is very common of data to with have many repeating A%+970-,'3.%,-;4' (or similar) queries ' generated by dashboards response and pre-built times reports. in seconds, Also, many interactive ad-hoc explorations, dashboard, while eventually We use innovative diving deep data into encoding a unique subset & storage of the technology data, notifications typically start and by progressively alerts and adding enterprise filters on reporting. top of the logical to data compress model. Our data engine by takes several advantage folds. of this Vertically SIGVIEW is based on Apache Spark and easily predictable BI user behavior pattern by automatically caching result partitioning sets both the data intermediate means we and only near read final the ones data and that integrates with an organizations existing IT is required reducing infrastructure overhead & improve infrastructure transparently at the using lowest them TCO when and applicable. highest ROI. The adaptive cache performance. is also incremental. For example, assume that query results were cached and later new rows (containing information used by the query) are added. If the query is fired again, in most cases, the engine will still leverage the cached results. (0,1&$4"':021;$.2'<%,'(=4,/%04'?9++/'B3&-+4 It will compute #C9.D'!,&$1.4&.9,4 the query results only on the delta and will merge it with the cached results to find the final query output. As your It data will then volume replace increases; the old cached we simply results add with more the! new ones. This mechanism is transparent to the user, who will machines never see stale on the or inconsistent fly to handle data. the increased volume. SIGVIEW provides complete and enriched insights to Moreover you are not stuck with one monolithic all employees across all levels in an organization and machine or appliance. Our platform can horizontally not 3.Scale-Out just analysts. Architecture Information is optimized for a user s scale to 1000s of nodes. role As by you leveraging add more a concurrent unified BI users foundation or as your that query has the load starts to increase, you can easily start additional nodes to capability support of them, ingesting and remove data from them different when they data are sources. no longer needed. It can source both structured as well as unstructured :0"4)10;'@'10 #747%,/'*,%E4&.1%02 data, store in a single place without cubing or preaggregation so that business users can access memory projections of the data to significantly speed We use statistical indexes like bloom filters & in- 4. Fully Fault Tolerant information The system in uses the Mesos most for intuitive ensuring logical fault tolerance. manner A single "hot" up data master access runs in & a reduce cluster among query with processing some number time by through of standby interactive nodes that dashboards, are monitored ad-hoc by Zookeeper.Zookeeper queries, 10x monitors over other all the warehousing nodes the solutions master cluster and manages the and election API function of a new calls. master when the hot master node fails. Data replication is handled by reports HDFS. C*.171F4"'<%,'6174'34,142'G-.- ' SIGVIEW platform is optimized to store and analyze any time-series data. Hence it is not restricted to a 5. Easy and Flexible Deployment on cluster particular industry but can be used across retail, We support one-click deployment on AWS, GCE and Azure via banking, our custom logistics, cloud advertising management technology tool, akin to etc. a custom "recipe" on Chef. Real time data to real time decisions

ARCHITECTURE The system broadly consists of four (4) major components which are discussed in detail below: 1.Data Ingestion 2.Data Preparation 3.Data Manager 4.

3 ARCHITECTURE The system broadly consists of four (4) major components which are discussed in detail below: 1.Data Ingestion 2.Data Preparation 3.Data Manager 4.Query Engine DATA INGESTION Connectors to various standard data sources such as Amazon S3, Apache Kafka, HDFS, transactional databases, CRM systems and REST API can be used to ingest data into the system at user-defined, configurable, periodic intervals (which can be configured to be as low as 1 min). All standard file formats for ingestion i.e. CSV, JSON, XML, PARQUET etc. are natively supported DATA PREPARATION Data ingested into the system is converted to Parquet by the Spark based Wrangler Module. This module is also responsible for running data validations and maintaining the schema of the data throughout the system uniform. Its pluggable architecture also allows complex user-specific workflow logic (entire ETL pipelines for example) to be implemented. DATA PREPARATION PHASES Internally, an indexer module stores data in a form that ensures each projection of data represents one or more attributes of a logical table. This is completely transparent to the end user who will only view the underlying data as a single Logical Entity. This allows the system to access only the data actually required by the query. It also ensures that data is distributed across several projections, which further improves performance.if certain columns are always read together, e.g. PUBLISH- ER, COUNTRY, REVENUE, then these can be grouped together so that they are retrieved in a single I/O. Superfast decisions based on superfast insights

4 Projections of data may have some sort order associated with them that specifies the Horizontal Segmentation or Partitioning logic of the data. Details of these row groups are stored as Metadata by the Data Manager. Data Manager then automatically selects and returns a best set of overlapping projections for a particular table based on columns required by the fired query. Best set of projections is governed by the cost metrics defined for that particular projection. Though it may seem as if redundantly storing data in multiple projections is waste of disk space, encoding schemes in Parquet ensure that the resultant projections are only a fraction of the raw data. This reduces the amount of data to be read off the disk by the Query Engine, thus increasing Query Performance. Recognizing the fact that businesses often have ETL pipelines already implemented, this component has been made dynamic so that it can serve either as a complete ETL engine or as a component to simply convert already transformed data to the format required by the database. DATA MANAGER Data Manager acts as the central nervous system of the product, responsible for coordinating Data Ingestion, Data Preparation as well as Querying tasks. It consists of the following 2 modules: 1. Query Planner Query Planning is broken down into 3 phases i. Logical Plan Analysis ii. Physical Planning iii. Projection Planning In the Projection planning phase, Query Planner may generate multiple plans and compare them based on execution cost. All other phases are purely rule-based. Each phase uses different types of tree nodes; Query Planner contains the library for tree nodes for each of Logical, Physical and Projection Tree and data types required by each of them. PHASES OF QUERY PLANNING IN DATA MANAGER Logical Plan Each query contains relations which are computed in the form of an Abstract Syntax Tree. The raw query may contain several unresolved attribute references or relations.for example, in the SQL Query: select metric from Table, metric may not be a valid name and may be represented as mtr in the logical data. These logical operators are resolved using Data Manager Configuration Catalog which contains information regarding all the data sources along with their relation to the physical data. It starts with unresolved logical query and applies rules that do the following: 1. Look up attributes in Logical Source from the Configuration Catalog 2. Pipelining various operations to a single operation if possible 3. Optimize by pushing Aggregates below filters wherever feasible to read minimum amount of data as possible. Physical Plan In the physical planning phase, Data Manager takes a Logical Plan and generates a corresponding physical plan. This is done by mapping the appropriate Logical Operator/Operand to the corresponding Physical Operator/Operand using the relations mentioned in Data Manager Configuration Catalog. This may result in injection of implicit JOIN operations into the plan where Operands from 2 or more physical tables are used in the logical query. E.g. In Query: select Correlation(Temperature, Sales) from table where Store= Florida_Store_1 and TransactionDate > and TransationDat e< ; the temperature column may actually be present in some different physical source other than the transaction physical table. This may require us to JOIN the weather data with the transaction data. In depth insights into your data in Real Time

5 Projection Plan Projection Plan is the plan, which is understood by the Query Engine for execution. This plan modifies and optimizes the physical plan based on the different projections available for different operations. One or more projection plans may be generated from a single Physical Plan if multiple projections are present for a single Physical Plan. Cost based optimization algorithms are applied to select the most optimal plan from the different competing projection plans. Cost metrics are the cost of performing a particular operation on a projection. These may result in change of aggregate operations and algorithms used for joins used in the Physical Plan. 2. Ingestion Manager This module is responsible for triggering and managing execution of all ingestion related modules. This module also stores the metadata with the Ingestion Manager, which is later used by the Query Planner to decide the best set of projections for the query. The metadata file includes the cardinality information for each dimension and inverted indices. During processing, query engine would lookup the metadata file and return a list of segment ids. Number of segment ids returned is passed to the job server, which helps determine number of CPU cores to be used. This leads to significantly lower CPU utilization. QUERY ENGINE Query Engine is an execution module build on top of Apache Spark which is a fast and general-purpose cluster computing system. Spark stores data in-memory and uses a powerful data abstraction paradigm, resilient distributed datasets (RDD), which is a clever way of guaranteeing fault tolerance and minimizes network I/O. It has the ability to cache datasets in memory for interactive data analysis: extract a working set, cache it and query it repeatedly. Query Engine translates the Projection Plan generated by the Data Manager to an equivalent Spark Dataframe Query, which is then executed on the Spark Cluster. Apart from complete execution of the Query, it supports sampling of results in order to help Query Planner decide on a better set of projections. Once the query is executed on the cluster, it hands over the result to the UI server, which transmits this information to the dashboard. IMPROVEMENTS TO OPEN-SOURCE SPARK There are a few compatible improvements which have been made to the open-source version of Spark for performance reasons. For example, in cases where the query involves a JOIN Operation of a large table with a smaller table, Query Engine ensures that the smaller table is broadcasted across all Spark Executors and retained across jobs. This is in contrast to normal Spark where the table is not retained across different queries and requires to be broadcasted every time. This has helped improve performance for costly operations like JOIN by orders of magnitude. We are in the process of contributing this back to the open-source community. SUPPORTED FUNCTIONS Basic - Average, Count, Min, Max, Sum, Product, Percentage, First, Last, etc. Advanced - Correlation, Covariance, Trigonometric Functions, Power, Calculated Metrics (e.g. Sales/Quantity as AUR), Count Distinct (Exact as well as Approximation) etc. UPCOMING FEATURES 1. Data type specific encoding & compression 2. In-memory columnar storage 3. Vectorization using SIMD instructions Real time data to real time decisions

6 QUERY PROCESSING WORKFLOW PERFORMANCE BENCHMARKS (CONDUCTED BY ONE OF OUR CUSTOMERS) Query description Impala Sigmoid (Spark) Actian Vortex Vertica One metric, hourly, one week 18s 0.8s 1.2s 1.4s All metrics, hourly, one week 28s 0.8s 6.3s 8.7s All metrics hourly, one week, one filter 21s 0.9s 3.2s 3.6s All metrics hourly, two filters 38s 2s 2.8s 0.9s Group by, one week, no filter, all metrics Group by, one week, no filter, all metrics Group by, one week, no filter, one metric Group by, all metrics, one week, one filter Group by, all metrics, one week, two filters 35s 1.6s 7.7s 0.9s 28s 1.6s 7.8s - 17s 1s 2.4s 1.2s 26s 3.2s 3.4s 1.3s 38s 3.2s 12.7s 1.6s Superfast decisions based on superfast insights

7 1) LOGIN VIEW 2) COMPARISON VIEW In depth insights into your data in Real Time

8 3) METRIC ADDITION 4) DIMENSION ADDITION Real time data to real time decisions

9 CONSTRUCT DASHBOARDS CREATE FREE-FLOW CHARTS (DRAG-DROP DIMENSIONS / MEASURES) Superfast decisions based on superfast insights

10 Sigmoid 1343 Kingfisher Way, Sunnyvale, CA,

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,