Technical Sheet NITRODB Time-Series Database

Size: px
Start display at page:

Download "Technical Sheet NITRODB Time-Series Database"

Transcription

1 Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost

2 INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes of time-series data at lightning fast speeds with an extremely low hardware requirement. It has been designed with a single goal in mind: Deliver 10X query performance using 1/10th the architecture and technology that powers this vision. Some of the key features that enable this are:!"#$"%&!!!!"#$%&'&$!()"*+))$,*'+--"#+*.+$/0-('"0*" 1.Intelligent Data Storage Data is stored on disk as Parquet, which is the columnar data format for Hadoop. This format avoids reading of columns, which are not needed by the query in contrast to conventional row-oriented databases, which read all columns for a row thus leading to high disk I/O. Moreover, Parquet provides a number of compression & encoding techniques which not only reduces the overall storage footprint but also dramatically improves query performance by reducing CPU, memory, and disk I/O requirements at processing time. Original data size is generally reduced by 90% with zero data loss even with high- availability redundancy turned on.!" #$%&'()*+%,-.%,/'!0-+/.1&2'-.'3&-+4' 2.Intelligent Data Processing The system generates an intelligent mix of specific materialized views (for frequently accessed data) as well as primary indexes (BTree, Bloom filter based) on the ingested raw data and stores them as segments distributed across the cluster. Only the index segments / views that are required by a query are targeted thus reducing volume of data that needs to be read and processed. Since fewer CPU cores are required to process each query, the unused computing resources are made available for other queries so that they can be run in parallel.! FEATURES! The system also includes several other features designed to offer performance, scalability, reliability, and ease of use. These include: 1.Local caching 54-+#6174' ':0.4++1;40&4 Repeatedly issuing I/O requests to the same ' few popular columns, >4/'?4-.9,42'@'8404<1.2' indexes and metadata files is time consuming. To! prevent this, frequently accessed column and index blocks as well as small metadata files are automatically cached. SIGVIEW is Sigmoid s fully managed full stack SIGVIEW has been designed for ad-hoc exploratory This reduces the number of round-trips to the HDFS NameNode Business Intelligence solution that combines Analytics where the at scale data resides so that resulting it is able in superior to respond performance. to your revolutionary in-memory columnar database, analytics queries in less than a few seconds. Reporting which engine and interactive visual frontend. It delivers a full used to take hours is now ready in minutes. range 2.Adaptive of capabilities query - and Ingest index over cache a million events per second, When ad-hoc working with query a BI on tool, petabytes it is very common of data to with have many repeating A%+970-,'3.%,-;4' (or similar) queries ' generated by dashboards response and pre-built times reports. in seconds, Also, many interactive ad-hoc explorations, dashboard, while eventually We use innovative diving deep data into encoding a unique subset & storage of the technology data, notifications typically start and by progressively alerts and adding enterprise filters on reporting. top of the logical to data compress model. Our data engine by takes several advantage folds. of this Vertically SIGVIEW is based on Apache Spark and easily predictable BI user behavior pattern by automatically caching result partitioning sets both the data intermediate means we and only near read final the ones data and that integrates with an organizations existing IT is required reducing infrastructure overhead & improve infrastructure transparently at the using lowest them TCO when and applicable. highest ROI. The adaptive cache performance. is also incremental. For example, assume that query results were cached and later new rows (containing information used by the query) are added. If the query is fired again, in most cases, the engine will still leverage the cached results. (0,1&$4"':021;$.2'<%,'(=4,/%04'?9++/'B3&-+4 It will compute #C9.D'!,&$1.4&.9,4 the query results only on the delta and will merge it with the cached results to find the final query output. As your It data will then volume replace increases; the old cached we simply results add with more the! new ones. This mechanism is transparent to the user, who will machines never see stale on the or inconsistent fly to handle data. the increased volume. SIGVIEW provides complete and enriched insights to Moreover you are not stuck with one monolithic all employees across all levels in an organization and machine or appliance. Our platform can horizontally not 3.Scale-Out just analysts. Architecture Information is optimized for a user s scale to 1000s of nodes. role As by you leveraging add more a concurrent unified BI users foundation or as your that query has the load starts to increase, you can easily start additional nodes to capability support of them, ingesting and remove data from them different when they data are sources. no longer needed. It can source both structured as well as unstructured :0"4)10;'@'10 #747%,/'*,%E4&.1%02 data, store in a single place without cubing or preaggregation so that business users can access memory projections of the data to significantly speed We use statistical indexes like bloom filters & in- 4. Fully Fault Tolerant information The system in uses the Mesos most for intuitive ensuring logical fault tolerance. manner A single "hot" up data master access runs in & a reduce cluster among query with processing some number time by through of standby interactive nodes that dashboards, are monitored ad-hoc by Zookeeper.Zookeeper queries, 10x monitors over other all the warehousing nodes the solutions master cluster and manages the and election API function of a new calls. master when the hot master node fails. Data replication is handled by reports HDFS. C*.171F4"'<%,'6174'34,142'G-.- ' SIGVIEW platform is optimized to store and analyze any time-series data. Hence it is not restricted to a 5. Easy and Flexible Deployment on cluster particular industry but can be used across retail, We support one-click deployment on AWS, GCE and Azure via banking, our custom logistics, cloud advertising management technology tool, akin to etc. a custom "recipe" on Chef. Real time data to real time decisions

3 ARCHITECTURE The system broadly consists of four (4) major components which are discussed in detail below: 1.Data Ingestion 2.Data Preparation 3.Data Manager 4.Query Engine DATA INGESTION Connectors to various standard data sources such as Amazon S3, Apache Kafka, HDFS, transactional databases, CRM systems and REST API can be used to ingest data into the system at user-defined, configurable, periodic intervals (which can be configured to be as low as 1 min). All standard file formats for ingestion i.e. CSV, JSON, XML, PARQUET etc. are natively supported DATA PREPARATION Data ingested into the system is converted to Parquet by the Spark based Wrangler Module. This module is also responsible for running data validations and maintaining the schema of the data throughout the system uniform. Its pluggable architecture also allows complex user-specific workflow logic (entire ETL pipelines for example) to be implemented. DATA PREPARATION PHASES Internally, an indexer module stores data in a form that ensures each projection of data represents one or more attributes of a logical table. This is completely transparent to the end user who will only view the underlying data as a single Logical Entity. This allows the system to access only the data actually required by the query. It also ensures that data is distributed across several projections, which further improves performance.if certain columns are always read together, e.g. PUBLISH- ER, COUNTRY, REVENUE, then these can be grouped together so that they are retrieved in a single I/O. Superfast decisions based on superfast insights

4 Projections of data may have some sort order associated with them that specifies the Horizontal Segmentation or Partitioning logic of the data. Details of these row groups are stored as Metadata by the Data Manager. Data Manager then automatically selects and returns a best set of overlapping projections for a particular table based on columns required by the fired query. Best set of projections is governed by the cost metrics defined for that particular projection. Though it may seem as if redundantly storing data in multiple projections is waste of disk space, encoding schemes in Parquet ensure that the resultant projections are only a fraction of the raw data. This reduces the amount of data to be read off the disk by the Query Engine, thus increasing Query Performance. Recognizing the fact that businesses often have ETL pipelines already implemented, this component has been made dynamic so that it can serve either as a complete ETL engine or as a component to simply convert already transformed data to the format required by the database. DATA MANAGER Data Manager acts as the central nervous system of the product, responsible for coordinating Data Ingestion, Data Preparation as well as Querying tasks. It consists of the following 2 modules: 1. Query Planner Query Planning is broken down into 3 phases i. Logical Plan Analysis ii. Physical Planning iii. Projection Planning In the Projection planning phase, Query Planner may generate multiple plans and compare them based on execution cost. All other phases are purely rule-based. Each phase uses different types of tree nodes; Query Planner contains the library for tree nodes for each of Logical, Physical and Projection Tree and data types required by each of them. PHASES OF QUERY PLANNING IN DATA MANAGER Logical Plan Each query contains relations which are computed in the form of an Abstract Syntax Tree. The raw query may contain several unresolved attribute references or relations.for example, in the SQL Query: select metric from Table, metric may not be a valid name and may be represented as mtr in the logical data. These logical operators are resolved using Data Manager Configuration Catalog which contains information regarding all the data sources along with their relation to the physical data. It starts with unresolved logical query and applies rules that do the following: 1. Look up attributes in Logical Source from the Configuration Catalog 2. Pipelining various operations to a single operation if possible 3. Optimize by pushing Aggregates below filters wherever feasible to read minimum amount of data as possible. Physical Plan In the physical planning phase, Data Manager takes a Logical Plan and generates a corresponding physical plan. This is done by mapping the appropriate Logical Operator/Operand to the corresponding Physical Operator/Operand using the relations mentioned in Data Manager Configuration Catalog. This may result in injection of implicit JOIN operations into the plan where Operands from 2 or more physical tables are used in the logical query. E.g. In Query: select Correlation(Temperature, Sales) from table where Store= Florida_Store_1 and TransactionDate > and TransationDat e< ; the temperature column may actually be present in some different physical source other than the transaction physical table. This may require us to JOIN the weather data with the transaction data. In depth insights into your data in Real Time

5 Projection Plan Projection Plan is the plan, which is understood by the Query Engine for execution. This plan modifies and optimizes the physical plan based on the different projections available for different operations. One or more projection plans may be generated from a single Physical Plan if multiple projections are present for a single Physical Plan. Cost based optimization algorithms are applied to select the most optimal plan from the different competing projection plans. Cost metrics are the cost of performing a particular operation on a projection. These may result in change of aggregate operations and algorithms used for joins used in the Physical Plan. 2. Ingestion Manager This module is responsible for triggering and managing execution of all ingestion related modules. This module also stores the metadata with the Ingestion Manager, which is later used by the Query Planner to decide the best set of projections for the query. The metadata file includes the cardinality information for each dimension and inverted indices. During processing, query engine would lookup the metadata file and return a list of segment ids. Number of segment ids returned is passed to the job server, which helps determine number of CPU cores to be used. This leads to significantly lower CPU utilization. QUERY ENGINE Query Engine is an execution module build on top of Apache Spark which is a fast and general-purpose cluster computing system. Spark stores data in-memory and uses a powerful data abstraction paradigm, resilient distributed datasets (RDD), which is a clever way of guaranteeing fault tolerance and minimizes network I/O. It has the ability to cache datasets in memory for interactive data analysis: extract a working set, cache it and query it repeatedly. Query Engine translates the Projection Plan generated by the Data Manager to an equivalent Spark Dataframe Query, which is then executed on the Spark Cluster. Apart from complete execution of the Query, it supports sampling of results in order to help Query Planner decide on a better set of projections. Once the query is executed on the cluster, it hands over the result to the UI server, which transmits this information to the dashboard. IMPROVEMENTS TO OPEN-SOURCE SPARK There are a few compatible improvements which have been made to the open-source version of Spark for performance reasons. For example, in cases where the query involves a JOIN Operation of a large table with a smaller table, Query Engine ensures that the smaller table is broadcasted across all Spark Executors and retained across jobs. This is in contrast to normal Spark where the table is not retained across different queries and requires to be broadcasted every time. This has helped improve performance for costly operations like JOIN by orders of magnitude. We are in the process of contributing this back to the open-source community. SUPPORTED FUNCTIONS Basic - Average, Count, Min, Max, Sum, Product, Percentage, First, Last, etc. Advanced - Correlation, Covariance, Trigonometric Functions, Power, Calculated Metrics (e.g. Sales/Quantity as AUR), Count Distinct (Exact as well as Approximation) etc. UPCOMING FEATURES 1. Data type specific encoding & compression 2. In-memory columnar storage 3. Vectorization using SIMD instructions Real time data to real time decisions

6 QUERY PROCESSING WORKFLOW PERFORMANCE BENCHMARKS (CONDUCTED BY ONE OF OUR CUSTOMERS) Query description Impala Sigmoid (Spark) Actian Vortex Vertica One metric, hourly, one week 18s 0.8s 1.2s 1.4s All metrics, hourly, one week 28s 0.8s 6.3s 8.7s All metrics hourly, one week, one filter 21s 0.9s 3.2s 3.6s All metrics hourly, two filters 38s 2s 2.8s 0.9s Group by, one week, no filter, all metrics Group by, one week, no filter, all metrics Group by, one week, no filter, one metric Group by, all metrics, one week, one filter Group by, all metrics, one week, two filters 35s 1.6s 7.7s 0.9s 28s 1.6s 7.8s - 17s 1s 2.4s 1.2s 26s 3.2s 3.4s 1.3s 38s 3.2s 12.7s 1.6s Superfast decisions based on superfast insights

7 1) LOGIN VIEW 2) COMPARISON VIEW In depth insights into your data in Real Time

8 3) METRIC ADDITION 4) DIMENSION ADDITION Real time data to real time decisions

9 CONSTRUCT DASHBOARDS CREATE FREE-FLOW CHARTS (DRAG-DROP DIMENSIONS / MEASURES) Superfast decisions based on superfast insights

10 Sigmoid 1343 Kingfisher Way, Sunnyvale, CA,

Part 1: Indexes for Big Data

Part 1: Indexes for Big Data JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Oracle Big Data Connectors

Oracle Big Data Connectors Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process

More information

Security and Performance advances with Oracle Big Data SQL

Security and Performance advances with Oracle Big Data SQL Security and Performance advances with Oracle Big Data SQL Jean-Pierre Dijcks Oracle Redwood Shores, CA, USA Key Words SQL, Oracle, Database, Analytics, Object Store, Files, Big Data, Big Data SQL, Hadoop,

More information

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes AN UNDER THE HOOD LOOK Databricks Delta, a component of the Databricks Unified Analytics Platform*, is a unified

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways

More information

Accelerate Big Data Insights

Accelerate Big Data Insights Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not

More information

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data Oracle Big Data SQL Release 3.2 The unprecedented explosion in data that can be made useful to enterprises from the Internet of Things, to the social streams of global customer bases has created a tremendous

More information

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? White Paper Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? How to Accelerate BI on Hadoop: Cubes or Indexes? Why not both? 1 +1(844)384-3844 INFO@JETHRO.IO Overview Organizations are storing more

More information

Approaching the Petabyte Analytic Database: What I learned

Approaching the Petabyte Analytic Database: What I learned Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may

More information

Was ist dran an einer spezialisierten Data Warehousing platform?

Was ist dran an einer spezialisierten Data Warehousing platform? Was ist dran an einer spezialisierten Data Warehousing platform? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Data warehousing, Exadata, specialized hardware proprietary hardware Introduction

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters ARTICLE Using Alluxio to Improve the Performance and Consistency of HDFS Clusters Calvin Jia Software Engineer at Alluxio Learn how Alluxio is used in clusters with co-located compute and storage to improve

More information

Lambda Architecture for Batch and Stream Processing. October 2018

Lambda Architecture for Batch and Stream Processing. October 2018 Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Microsoft Exam

Microsoft Exam Volume: 42 Questions Case Study: 1 Relecloud General Overview Relecloud is a social media company that processes hundreds of millions of social media posts per day and sells advertisements to several hundred

More information

Data Analytics at Logitech Snowflake + Tableau = #Winning

Data Analytics at Logitech Snowflake + Tableau = #Winning Welcome # T C 1 8 Data Analytics at Logitech Snowflake + Tableau = #Winning Avinash Deshpande I am a futurist, scientist, engineer, designer, data evangelist at heart Find me at Avinash Deshpande Chief

More information

WHITEPAPER. MemSQL Enterprise Feature List

WHITEPAPER. MemSQL Enterprise Feature List WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

Massive Scalability With InterSystems IRIS Data Platform

Massive Scalability With InterSystems IRIS Data Platform Massive Scalability With InterSystems IRIS Data Platform Introduction Faced with the enormous and ever-growing amounts of data being generated in the world today, software architects need to pay special

More information

Evolving To The Big Data Warehouse

Evolving To The Big Data Warehouse Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved. Gain Insights From Unstructured Data Using Pivotal HD 1 Traditional Enterprise Analytics Process 2 The Fundamental Paradigm Shift Internet age and exploding data growth Enterprises leverage new data sources

More information

April Copyright 2013 Cloudera Inc. All rights reserved.

April Copyright 2013 Cloudera Inc. All rights reserved. Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

ELTMaestro for Spark: Data integration on clusters

ELTMaestro for Spark: Data integration on clusters Introduction Spark represents an important milestone in the effort to make computing on clusters practical and generally available. Hadoop / MapReduce, introduced the early 2000s, allows clusters to be

More information

1Z0-526

1Z0-526 1Z0-526 Passing Score: 800 Time Limit: 4 min Exam A QUESTION 1 ABC's Database administrator has divided its region table into several tables so that the west region is in one table and all the other regions

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Key Differentiators. What sets Ideal Anaytics apart from traditional BI tools

Key Differentiators. What sets Ideal Anaytics apart from traditional BI tools Key Differentiators What sets Ideal Anaytics apart from traditional BI tools Ideal-Analytics is a suite of software tools to glean information and therefore knowledge, from raw data. Self-service, real-time,

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

Actian Vector Benchmarks. Cloud Benchmarking Summary Report Actian Vector Benchmarks Cloud Benchmarking Summary Report April 2018 The Cloud Database Performance Benchmark Executive Summary The table below shows Actian Vector as evaluated against Amazon Redshift,

More information

BI ENVIRONMENT PLANNING GUIDE

BI ENVIRONMENT PLANNING GUIDE BI ENVIRONMENT PLANNING GUIDE Business Intelligence can involve a number of technologies and foster many opportunities for improving your business. This document serves as a guideline for planning strategies

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

Time Series Analytics with Simple Relational Database Paradigms Ben Leighton, Julia Anticev, Alex Khassapov

Time Series Analytics with Simple Relational Database Paradigms Ben Leighton, Julia Anticev, Alex Khassapov Time Series Analytics with Simple Relational Database Paradigms Ben Leighton, Julia Anticev, Alex Khassapov LAND AND WATER & CSIRO IMT SCIENTIFIC COMPUTING Energy Use Data Model (EUDM) endeavours to deliver

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud Microsoft Azure Databricks for data engineering Building production data pipelines with Apache Spark in the cloud Azure Databricks As companies continue to set their sights on making data-driven decisions

More information

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools

Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools SAP Technical Brief Data Warehousing SAP HANA Data Warehousing Combine Native SQL Flexibility with SAP HANA Platform Performance and Tools A data warehouse for the modern age Data warehouses have been

More information

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop HAWQ: A Massively Parallel Processing SQL Engine in Hadoop Lei Chang, Zhanwei Wang, Tao Ma, Lirong Jian, Lili Ma, Alon Goldshuv Luke Lonergan, Jeffrey Cohen, Caleb Welton, Gavin Sherry, Milind Bhandarkar

More information

Case Study: Tata Communications Delivering a Truly Interactive Business Intelligence Experience on a Large Multi-Tenant Hadoop Cluster

Case Study: Tata Communications Delivering a Truly Interactive Business Intelligence Experience on a Large Multi-Tenant Hadoop Cluster Case Study: Tata Communications Delivering a Truly Interactive Business Intelligence Experience on a Large Multi-Tenant Hadoop Cluster CASE STUDY: TATA COMMUNICATIONS 1 Ten years ago, Tata Communications,

More information

Tuning Enterprise Information Catalog Performance

Tuning Enterprise Information Catalog Performance Tuning Enterprise Information Catalog Performance Copyright Informatica LLC 2015, 2018. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United States

More information

EsgynDB Enterprise 2.0 Platform Reference Architecture

EsgynDB Enterprise 2.0 Platform Reference Architecture EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed

More information

Cloudera Kudu Introduction

Cloudera Kudu Introduction Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)

More information

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data IBM Db2 Event Store Disclaimer The information contained in this presentation is provided for informational purposes only.

More information

Enterprise Data Catalog for Microsoft Azure Tutorial

Enterprise Data Catalog for Microsoft Azure Tutorial Enterprise Data Catalog for Microsoft Azure Tutorial VERSION 10.2 JANUARY 2018 Page 1 of 45 Contents Tutorial Objectives... 4 Enterprise Data Catalog Overview... 5 Overview... 5 Objectives... 5 Enterprise

More information

Exam Questions

Exam Questions Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

Shine a Light on Dark Data with Vertica Flex Tables

Shine a Light on Dark Data with Vertica Flex Tables White Paper Analytics and Big Data Shine a Light on Dark Data with Vertica Flex Tables Hidden within the dark recesses of your enterprise lurks dark data, information that exists but is forgotten, unused,

More information

Overview of Data Services and Streaming Data Solution with Azure

Overview of Data Services and Streaming Data Solution with Azure Overview of Data Services and Streaming Data Solution with Azure Tara Mason Senior Consultant tmason@impactmakers.com Platform as a Service Offerings SQL Server On Premises vs. Azure SQL Server SQL Server

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Down the event-driven road: Experiences of integrating streaming into analytic data platforms Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich, 8.10.2018 Integrate

More information

Configuring and Deploying Hadoop Cluster Deployment Templates

Configuring and Deploying Hadoop Cluster Deployment Templates Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page

More information

Postgres Plus and JBoss

Postgres Plus and JBoss Postgres Plus and JBoss A New Division of Labor for New Enterprise Applications An EnterpriseDB White Paper for DBAs, Application Developers, and Enterprise Architects October 2008 Postgres Plus and JBoss:

More information

REGULATORY REPORTING FOR FINANCIAL SERVICES

REGULATORY REPORTING FOR FINANCIAL SERVICES REGULATORY REPORTING FOR FINANCIAL SERVICES Gordon Hughes, Global Sales Director, Intel Corporation Sinan Baskan, Solutions Director, Financial Services, MarkLogic Corporation Many regulators and regulations

More information

VOLTDB + HP VERTICA. page

VOLTDB + HP VERTICA. page VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics

More information

GPU Accelerated Data Processing Speed of Thought Analytics at Scale

GPU Accelerated Data Processing Speed of Thought Analytics at Scale GPU Accelerated Data Processing Speed of Thought Analytics at Scale The benefits of Brytlyt s GPU Accelerated Database Brytlyt is an ultra-high performance database that combines patent pending intellectual

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Achieving Horizontal Scalability. Alain Houf Sales Engineer Achieving Horizontal Scalability Alain Houf Sales Engineer Scale Matters InterSystems IRIS Database Platform lets you: Scale up and scale out Scale users and scale data Mix and match a variety of approaches

More information

Informatica Enterprise Information Catalog

Informatica Enterprise Information Catalog Data Sheet Informatica Enterprise Information Catalog Benefits Automatically catalog and classify all types of data across the enterprise using an AI-powered catalog Identify domains and entities with

More information

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti Solution Overview Cisco UCS Integrated Infrastructure for Big Data with the Elastic Stack Cisco and Elastic deliver a powerful, scalable, and programmable IT operations and security analytics platform

More information

In-Memory Data Management Jens Krueger

In-Memory Data Management Jens Krueger In-Memory Data Management Jens Krueger Enterprise Platform and Integration Concepts Hasso Plattner Intitute OLTP vs. OLAP 2 Online Transaction Processing (OLTP) Organized in rows Online Analytical Processing

More information

Introduction to Database Services

Introduction to Database Services Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Cloud Analytics and Business Intelligence on AWS

Cloud Analytics and Business Intelligence on AWS Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse

More information

Modernizing Business Intelligence and Analytics

Modernizing Business Intelligence and Analytics Modernizing Business Intelligence and Analytics Justin Erickson Senior Director, Product Management 1 Agenda What benefits can I achieve from modernizing my analytic DB? When and how do I migrate from

More information

Actian SQL Analytics in Hadoop

Actian SQL Analytics in Hadoop Actian SQL Analytics in Hadoop The Fastest, Most Industrialized SQL in Hadoop A Technical Overview 2015 Actian Corporation. All Rights Reserved. Actian product names are trademarks of Actian Corp. Other

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage White Paper Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage What You Will Learn A Cisco Tetration Analytics appliance bundles computing, networking, and storage resources in one

More information

AUTOMATIC CLUSTERING PRASANNA RAJAPERUMAL I MARCH Snowflake Computing Inc. All Rights Reserved

AUTOMATIC CLUSTERING PRASANNA RAJAPERUMAL I MARCH Snowflake Computing Inc. All Rights Reserved AUTOMATIC CLUSTERING PRASANNA RAJAPERUMAL I MARCH 2019 SNOWFLAKE Our vision Allow our customers to access all their data in one place so they can make actionable decisions anytime, anywhere, with any number

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value KNOWLEDGENT INSIGHTS volume 1 no. 5 October 7, 2011 Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value Today s growing commercial, operational and regulatory

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

Talend Big Data Sandbox. Big Data Insights Cookbook

Talend Big Data Sandbox. Big Data Insights Cookbook Overview Pre-requisites Setup & Configuration Hadoop Distribution Download Demo (Scenario) Overview Pre-requisites Setup & Configuration Hadoop Distribution Demo (Scenario) About this cookbook What is

More information

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData

Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData Interactive SQL-on-Hadoop from Impala to Hive/Tez to Spark SQL to JethroData ` Ronen Ovadya, Ofir Manor, JethroData About JethroData Founded 2012 Raised funding from Pitango in 2013 Engineering in Israel,

More information

OLAP Introduction and Overview

OLAP Introduction and Overview 1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC https://ignite.apache.org @apacheignite @dsetrakyan Agenda About In- Memory Computing Apache Ignite

More information

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform Data Virtualization Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform Introduction Caching is one of the most important capabilities of a Data Virtualization

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI /

Index. Raul Estrada and Isaac Ruiz 2016 R. Estrada and I. Ruiz, Big Data SMACK, DOI / Index A ACID, 251 Actor model Akka installation, 44 Akka logos, 41 OOP vs. actors, 42 43 thread-based concurrency, 42 Agents server, 140, 251 Aggregation techniques materialized views, 216 probabilistic

More information

Hortonworks DataFlow Sam Lachterman Solutions Engineer

Hortonworks DataFlow Sam Lachterman Solutions Engineer Hortonworks DataFlow Sam Lachterman Solutions Engineer 1 Hortonworks Inc. 2011 2017. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development,

More information

Microsoft SQL Server Training Course Catalogue. Learning Solutions

Microsoft SQL Server Training Course Catalogue. Learning Solutions Training Course Catalogue Learning Solutions Querying SQL Server 2000 with Transact-SQL Course No: MS2071 Two days Instructor-led-Classroom 2000 The goal of this course is to provide students with the

More information

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may

More information