The Definitive Guide to MongoDB Analytics

Similar documents
Introduction to K2View Fabric

QLIKVIEW ARCHITECTURAL OVERVIEW

How to analyze JSON with SQL

BUYING DECISION CRITERIA WHEN DEVELOPING IOT SENSORS

One of the fundamental kinds of websites that SharePoint 2010 allows

How to integrate data into Tableau

Shine a Light on Dark Data with Vertica Flex Tables

Crash Course in Modernization. A whitepaper from mrc

Stages of Data Processing

Oracle Big Data Connectors

Introduction to User Stories. CSCI 5828: Foundations of Software Engineering Lecture 05 09/09/2014

FIVE BEST PRACTICES FOR ENSURING A SUCCESSFUL SQL SERVER MIGRATION

Designing High-Performance Data Structures for MongoDB

JAVASCRIPT CHARTING. Scaling for the Enterprise with Metric Insights Copyright Metric insights, Inc.

THINGS YOU NEED TO KNOW ABOUT USER DOCUMENTATION DOCUMENTATION BEST PRACTICES

Excel Basics: Working with Spreadsheets

Whitepaper. Solving Complex Hierarchical Data Integration Issues. What is Complex Data? Types of Data

GPU Accelerated Data Processing Speed of Thought Analytics at Scale

Designing dashboards for performance. Reference deck

Creating a target user and module

Close Your File Template

Up and Running Software The Development Process

SOFTWARE DEFINED STORAGE VS. TRADITIONAL SAN AND NAS

Popular SIEM vs aisiem

Knowledge Happens. We Don t Use Databases. Integrating Oracle and Hadoop. Be Very Afraid. Much more inside... Vol. 27, No. 1 FEBRUARY 2013 $15

THE RISE OF. The Disruptive Data Warehouse

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Modern Data Warehouse The New Approach to Azure BI

Title: Episode 11 - Walking through the Rapid Business Warehouse at TOMS Shoes (Duration: 18:10)

An Introduction to Big Data Formats

A quick guide to... Split-Testing

Document your findings about the legacy functions that will be transformed to

Evaluation Guide for ASP.NET Web CMS and Experience Platforms

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

MongoDB Schema Design for. David Murphy MongoDB Practice Manager - Percona

INTRODUCTION. In this guide, I m going to walk you through the most effective strategies for growing an list in 2016.

The QuickStudy Guide for Zoho CRM

Digital Marketing Manager, Marketing Manager, Agency Owner. Bachelors in Marketing, Advertising, Communications, or equivalent experience

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

Unit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics

How to set up SQL Source Control The short guide for evaluators

What is Grails4Notes(TM)?

Building your own BMC Remedy AR System v7 Applications. Maruthi Dogiparthi

Progress DataDirect For Business Intelligence And Analytics Vendors

Distributed Databases: SQL vs NoSQL

IBM POWER SYSTEMS: YOUR UNFAIR ADVANTAGE

What is version control? (discuss) Who has used version control? Favorite VCS? Uses of version control (read)

In this chapter, I m going to show you how to create a working

ProServeIT Corporation Century Ave. Mississauga, ON L5N 6A4 T: TF: F: W: ProServeIT.

Contractors Guide to Search Engine Optimization

HEARTLAND DEVELOPER CONFERENCE 2017 APPLICATION DATA INTEGRATION WITH SQL SERVER INTEGRATION SERVICES

Designing a New. Data Dashboard. January Page 1

RethinkDB. Niharika Vithala, Deepan Sekar, Aidan Pace, and Chang Xu

Composite Software Data Virtualization The Five Most Popular Uses of Data Virtualization

COMMUNICATION PROTOCOLS

SQL, Scaling, and What s Unique About PostgreSQL

Teiid Designer User Guide 7.5.0

When, Where & Why to Use NoSQL?

INTRODUCTION. Chris Claterbos, Vlamis Software Solutions, Inc. REVIEW OF ARCHITECTURE

7+ GRAPHICS LIBRARIES TO ENHANCE YOUR EMBEDDED ANALYTICS

Virtualization. Q&A with an industry leader. Virtualization is rapidly becoming a fact of life for agency executives,

There And Back Again

The Idiot s Guide to Quashing MicroServices. Hani Suleiman

11G Chris Claterbos, Vlamis Software Solutions, Inc.

Microsoft certified solutions associate

Provide Real-Time Data To Financial Applications

Security Automation Best Practices

2016 All Rights Reserved

GradeConnect.com. User Manual

Considerations for Mobilizing your Lotus Notes Applications

DOWNLOAD PDF SQL SERVER 2005 FOR DEVELOPERS

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

Xcelerated Business Insights (xbi): Going beyond business intelligence to drive information value

SQL in the Hybrid World

On Media And Change: Think of What We ve Accomplished. Remarks & reflections by Matt Turner, MarkLogic, CTO, Media & Publishing

A detailed comparison of EasyMorph vs Tableau Prep

Real-Time & Big Data GIS: Best Practices. Suzanne Foss Josh Joyner

Transforming IT: From Silos To Services

Quick Reference Design Guide

E-Guide WHAT WINDOWS 10 ADOPTION MEANS FOR IT

Is SharePoint the. Andrew Chapman

Lecture 1: Overview

The New USB-C Standard and How to Select a Matching Docking Station

Oracle Enterprise Manager 12c Sybase ASE Database Plug-in

JANUARY Migrating standalone ArcGIS Server to ArcGIS Enterprise

Oracle Warehouse Builder 10g Release 2 Integrating Packaged Applications Data

For Volunteers An Elvanto Guide

USE CASE. Collect CLOSED CASE FEEDBACK. Salesforce Workflow. Clicktools Deployment TWO DEPLOYMENT APPROACHES. The basic activity flow goes like this:

Introduction to Data Science

Burning CDs in Windows XP

Introduction to Federation Server

Week - 01 Lecture - 04 Downloading and installing Python

Data Virtualization Implementation Methodology and Best Practices

The Salesforce Migration Playbook

Using GitHub to Share with SparkFun a

Animations involving numbers

Monitoring Java in Docker at CDK

What is Standard APEX? TOOLBOX FLAT DESIGN CARTOON PEOPLE

Legacy Transaction Integration TM In a Service-oriented Architecture (SOA)

Custom Web & Mobile. Here are some of the ways Pulsetracker provides Sales and Marketing Intelligence:

Transcription:

The Definitive Guide to MongoDB Analytics

Analytics on MongoDB is a different beast than what you re familiar with. Don t expect to fire up your existing analytics tool, point it at MongoDB, and go. This guide explains why a fundamentally different approach is necessary and what your options are.

A Quick History Lesson This is where you probably are, and what led you to reading this article. The Question How do you use existing solutions to analyze live, deeply nested, semi-structured, schema-free data in MongoDB? Everything Has Failed How do you use existing solutions to analyze live, deeply nested, semistructured, schema-free data in MongoDB? The Answer You don t. It s not possible. At all. What is needed is a way to analyze this new data format in a way that is both obvious to use but also natively understands the nested, schema-less data. Even when it changes. On the fly. No tools that exist for relational databases can do this. Not one. So then, what are the options?

Enter MongoDB+ Enter MongoDB in the late 2000 s. Terms like schema free, semistructured data and JSON become the norm while Entity Relationship Diagrams and DDLs become less prominent. The Three Vs defined by Gartner (Volume, Variety and Velocity) are starting to define the purpose of NoSQL vendors compared to their older relational cousins. New Databases, Great Improvements, But MongoDB and the new generation of databases brings with them great improvements but also require a new approach to analytics. MongoDB and the new generation of databases brings with them great improvements: Improved developer productivity Massive scalability Unheard of reliability Great performance by any standard Not to mention they re typically open source and, when compared to big-name vendors, can be had for pennies on the dollar for production support. But with these advancements comes a new way of thinking about your data.

Data Models? What Data Models?...No actual schema does not imply that the developer or DBA doesn t need to coordinate on the best approach... MongoDB requires no actual schema but that does not imply the developer and the DBA don t need to coordinate on the best approach to a long term solution. The advantages of NoSQL databases are often simply the result of saving and accessing data in new ways. For MongoDB that means storing complete objects in one compact area of disk (or memory, or CPU cache, or ). From a developer s perspective, however the object looks in memory is exactly how it looks stored in MongoDB. It s physically stored as a single document in a binary form of JSON referred to as BSON. Why? Consider this: it takes much less time and effort to load a truck full of parts from a single store and deliver it to a customer, than to have a truck (or multiple trucks) pick up parts from multiple stores before delivery to the same customer. The same goes for anything in our physical reality: it takes real energy and real clock cycles to perform X actions. If everything is kept in a single location there is only 1 action to perform, versus X multiplied by the number of relational tables the object is stored in. Many NoSQL databases, especially MongoDB, prefer to store data as complete objects in one location rather than discrete, normalized, small bits of information in a dozen or even a hundred different locations (tables). This is often the most difficult aspect to accept and understand when moving from relational databases to MongoDB. Remember: Not only is data stored in the same location, it is stored in a completely different format. Making the transition to MongoDB often finds developers and DBAs asking questions such as: How do I join this data? How do I sub-select? What key do I use?

Data Models? What Data Models? (cont d.) How do I create a 1/N/M to 1/N/M relationship here? How can one document (row) be completely different from the next? Why can t I get a standard schema description? Those questions aren t as important, and sometimes don t apply anymore, because remember: all the data for a single object can be stored together. Check out at the example on the next page. Say there is a database with two tables that store information about books and authors. In a relational database it would be easy to setup: Simple creation. Simple key relationship. This could work for a small or medium-sized data set without much change. It supports 1:1 (one-to-one) and 1:M (one-to-many) relationships. Adding a relationship table could add support for M:M (many-to-many) relationships. In JSON (MongoDB s document format of choice), the following data structure could be used:

Data Models? What Data Models? (cont d.) Retrieving this author document from the database also fetches the books the author wrote because those are also part of the document. This is great because it saves time by reading all the related data in a single action. But what about the flip side: what if the user wanted to find out more about the author based on a book instead? This is an important thing to note, and something MongoDB teaches their employees: the data model should be created based on expected query patterns from applications, not what the developer finds most convenient. This is the only way to ensure performance at scale.

Data Models? What Data Models? (cont d.) As an example: assume a user searches for all books with an ISBN starting with 1234. While the data model above lends itself well to finding an author and his or her books, and makes sense in a developer s object-oriented brain, it won t perform well searching for books by ISBN. Why? Because MongoDB has to look through each author document, then look at each array inside of that document to know whether it matches the user s query. An index could be added to the books array, which might help, but now adds another index MongoDB needs to write to when adding, updating or deleting data. Also be careful about adding indexes to arrays with MongoDB it can be a real problem down the road. Another option is to store book information in a separate collection and then refer to it in a second query or with a $lookup MongoDB aggregation command. Unfortunately that approach negates the performance benefits of avoiding JOINs originally. This is why the data model must be designed well in advance. MongoDB requires no actual schema but that does not imply the developer and the DBA don t need to coordinate on the best approach to a long term solution.

The Big Question: What Do I Do For Analytics on NoSQL? This is where you probably are, and what led you to reading this article. The Question How do you use existing solutions to analyze live, deeply nested, semi-structured, schema-free data in MongoDB? Everything Has Failed How do you use existing solutions to analyze live, deeply nested, semistructured, schema-free data in MongoDB? The Answer You don t. It s not possible. At all. What is needed is a way to analyze this new data format in a way that is both obvious to use but also natively understands the nested, schema-less data. Even when it changes. On the fly. No tools that exist for relational databases can do this. Not one. So then, what are the options?

Option #1: Custom Coding You ve built a really cool app using MongoDB. That s awesome! MongoDB provides drivers for every major programming language with solid documentation. They ve lowered the barrier of entry for developers to get started which is great. Getting simple data out of MongoDB isn t terribly hard. Usually. Getting meaningful data for making business decisions can be a different story. Many readers will relate to this it s a major reason this article was written. Developers can write apps quickly and utilize the JSON model to rapidly prototype ideas. It s easy to shove data into MongoDB. Getting simple data out isn t terribly hard, usually. Getting meaningful data for making business decisions can be a different story. Many readers will relate to this it s a major reason this article was written. You owe it to yourself to read this article by our CTO, John De Goes. It discusses the difficulties around creating dashboards geared toward customers and decision makers, all while avoiding the money pit that comes with custom-coded reporting. Pros Maximum flexibility Cons Everything is custom Significant MongoDB knowledge required Long term support of the custom solution Significant investment of time One-off solution for MongoDB Consider this approach when you: Have employees on the bench Already have deep knowledge of MongoDB aggregation and mapreduce functions Understand third party visualization tool integration

Option #2: ETL If you must use an existing two-dimensional (relational) database reporting tool with your multi-dimensional (nonrelational) database, this is the only option. There is simply no way to get a relational reporting tool to read, understand or display MongoDB data. Old Tools Don t Adapt. There is simply no way to get a relational reporting tool to read, understand or display MongoDB data. The Extract-Transform-Load approach has been used for a very long time with relational databases. I won t go into the details since there are better (and longer) articles elsewhere, and you ve likely investigated this route already. I will, however, give an example of why this approach is very difficult to implement from the technical perspective. Many people use MongoDB s ability to leverage the schema itself as data. What does this mean? It means you can use the field name (or column name in relational terms) as part of the data. Take the following JSON document as an example:

Option #2: ETL (cont d.) Notice the field names 2016-Q1 and 2016-Q2? We didn t list the field name as quarter and the field value as 2016-Q1. The field name is the value, and the remainder of the field value contains even more information. An ETL process takes rich, nested and self-described data and forces it into small, rigidly-typed containers so it can report in a confined, rigidly-typed way. Data fidelity will be lost during conversion with this kind of data. When considering an ETL approach readers must consider these points: How will deeply nested arrays be mapped to twodimensional tables? How will documents in the same collection, but with different schemas, be mapped? How will MongoDB schema changes be handled? Can the ETL solution handle the volume, variety and velocity of data? Will the solution scale to include new MongoDB applications and data? How fresh is the data from the ETL process? Bring your attention back to the first bullet point. Someone will need to manually map the MongoDB data model to the relational ETL model. This will need to happen again whenever the schema changes, and with MongoDB, the schema can change frequently. While scripts can be written, and some very basic tools exist that can handle the most rudimentary parts of this, the fact is that the vast majority of the data model can only be mapped by a human. Again, see the example document above.

Option #2: ETL (cont d.) Pros Most two-dimensional reporting tools can read the transformed data model. Cons Weeks or months of work to set up Cannot adjust to documents with different schemas Part-time or full-time DBA to maintain Significant investment in hardware, employee time and process Loss of data fidelity Accept that analysis is not live Consider this approach when you: Have already heavily invested in legacy reporting tools Are expected to use existing tools Can accept loss of data fidelity

Option #3: Native NoSQL Analytics with SlamData Native NoSQL analytics is a completely different approach to analytics. Given the document models shown earlier it s easy to see the disconnect that legacy relational reporting tools have. A Modern Tool for Modern Data SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. Any solution designed for multidimensional, NoSQL analytics must be designed and built for this from the ground up. Once a solution for multi-dimensional data is developed it s possible to then go back and apply it to two-dimensional data. It doesn t work the other way around. Unfortunately for existing BI vendors, the ability to natively work with MongoDB is not something that can be bolted on or included in a new version of an existing product. The best they can hope for is to mimic the ETL option. Make no mistake: reporting tools designed for relational databases will not analyze live, nested MongoDB data. Even MongoDB s official BI Connector performs an ETL process and stores data in PostgreSQL for two-dimensional analytics. SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. The SlamData product is a single analytics solution for business analysts, data scientists, developers, data architects and DBAs working with MongoDB. It natively understands dynamic, nested data and provides an interface built for it. All actions performed by SlamData occur on live data. Commands are sent to MongoDB in the most performant order based on the user s requested search. MongoDB

Option #3: Native NoSQL Analytics with SlamData Native NoSQL analytics is a completely different approach to analytics. Given the document models shown earlier it s easy to see the disconnect that legacy relational reporting tools have. A Modern Tool for Modern Data SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. Any solution designed for multidimensional, NoSQL analytics must be designed and built for this from the ground up. Once a solution for multi-dimensional data is developed it s possible to then go back and apply it to two-dimensional data. It doesn t work the other way around. Unfortunately for existing BI vendors, the ability to natively work with MongoDB is not something that can be bolted on or included in a new version of an existing product. The best they can hope for is to mimic the ETL option. Make no mistake: reporting tools designed for relational databases will not analyze live, nested MongoDB data. Even MongoDB s official BI Connector performs an ETL process and stores data in PostgreSQL for two-dimensional analytics. SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. The SlamData product is a single analytics solution for business analysts, data scientists, developers, data architects and DBAs working with MongoDB. It natively understands dynamic, nested data and provides an interface built for it. All actions performed by SlamData occur on live data. Commands are sent to MongoDB in the most performant order based on the user s requested search. MongoDB

Option #3: Native NoSQL Analytics with SlamData (cont d.) performs 100% of the computation and only the results are returned to SlamData. This is a key difference to understand: with ETL and existing BI tools an entire table (typically many tables) are returned and the solution must then perform analytics on the entire data set. A Modern Tool for Modern Data SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. With SlamData custom analytical workflows can be created by adding discrete action cards on top of one another. This allows actions such as querying MongoDB with SQL, displaying tabular reports, graphical charts, interactive forms and more. Cards can be stacked based on whatever the user is trying to accomplish. Developers can dynamically pass values into workspaces to control content and flow. DBAs can easily view schema and data. Business analysts can use standard SQL queries against MongoDB nested data. Users can interact with forms that allow self-service. All workflows, or any part of a workflow, can be securely embedded into other applications.

Option #3: Native NoSQL Analytics with SlamData (cont d.) Users can install SlamData and create dashboards on live MongoDB data in less than 60 minutes, regardless of schema. SlamData runs on Linux, OS X and Windows. It can also run on laptops, workstations, or as a server. It can run on bare metal or virtualvized. Since SlamData pushes 100% of its queries directly to the database for processing, there s no need for massive data transfers or heavy system requirements. The more optimized a MongoDB architecture is, the better SlamData runs. SlamData connects to any MongoDB database including remote instances and SSL-encrypted deployments too. Pros Immediate ROI

Option #3: Native NoSQL Analytics with SlamData (cont d.) Create embeddable reports in minutes after install Natively view, analyze and display deeply nested, semistructured data Use an enhanced SQL dialect that works on both relational and NoSQL data, instead of learning MongoDB s multiple proprietary approaches Graphically layout interactive forms, reports and charts Provide Google-like search functionality to MongoDB for end users Export data in multiple formats for custom processing Restrict data visibility and actions based on user authorization model Enterprise-grade multi-tenant security Cons Learning a new BI tool and new approach to MongoDB analytics Explaining the importance of this approach to nontechnical management Not as flexible as custom coding

There Is No Magic; It s Algebra And It s Open Source Our co-founders are sometimes heard saying something like There is no magic to SlamData, it s all there for the world to see. That s both true and a little misleading. A Completely New Technology. 100% Open Source. 100% Scalable. It takes an engineering team skilled in database technologies, mathematics, analytics, and advanced software development patterns to create a long-term, comprehensive solution. This type of solution isn t written overnight, or in a few months. It takes years to fundamentally change the way multidimensional data is modeled, understood and presented. It takes an engineering team skilled in database technologies, mathematics, analytics, and advanced software development patterns to create a long-term, comprehensive solution. So while our code can be checked out and modified on GitHub, it doesn t mean every developer who clones it will understand it. It may, in fact, look like magic. Arthur C. Clarke s Third Law comes to mind here. It s advanced. Don t believe me? Check it out for yourself.

What The Market Wants, What The Market Needs We re good at MongoDB Analytics. Really good. But we re not stopping there. We haven t spent years developing this to provide an amazing solution for just MongoDB. SlamData is the sole company that has built technology that bridges all of your data. All data sources, one analytics solution. Picture this for a moment: All data sources, one analytics solution. Picture this for a moment: All data sources, one analytics solution. Relational databases? Check. NoSQL databases? Check. XML, JSON and other nested flat file formats? Check. Cross-datasource (federated) queries and joins? Check.

What The Market Wants, What The Market Needs Relational databases? Check. NoSQL databases? Check. XML, JSON and other nested flat file formats? Check. Cross-datasource (federated) queries and joins? Check. Query and display log data, relational data and NoSQL data at the same time? Check. Pivot Tables and multidimensional data structure viewers? Check. Open Source? Check. One platform to query and analyze all data sources in your company? Check. While I m writing this we have a team of engineers writing connectors for several other databases that will be included in SlamData version 3.1. With our QScript connector technology we can create a connector for any data source (database, file, API) in a matter of weeks. You can expect several new data sources to be supported in each major release of SlamData. All with the same functionality that we currently provide for MongoDB. With it s ability to use standard SQL² across various data sources simultaneously, out-of-the-box visualizations, one-click embedding, customizable analytics workflows, enterprisegrade security, multi-tenant hosting capabilities, virtual views, interactive forms, 100% in-database query execution, and more SlamData is the only sensible approach for database analytics today.

2016 Slamdata