The Definitive Guide to MongoDB Analytics
Analytics on MongoDB is a different beast than what you re familiar with. Don t expect to fire up your existing analytics tool, point it at MongoDB, and go. This guide explains why a fundamentally different approach is necessary and what your options are.
A Quick History Lesson This is where you probably are, and what led you to reading this article. The Question How do you use existing solutions to analyze live, deeply nested, semi-structured, schema-free data in MongoDB? Everything Has Failed How do you use existing solutions to analyze live, deeply nested, semistructured, schema-free data in MongoDB? The Answer You don t. It s not possible. At all. What is needed is a way to analyze this new data format in a way that is both obvious to use but also natively understands the nested, schema-less data. Even when it changes. On the fly. No tools that exist for relational databases can do this. Not one. So then, what are the options?
Enter MongoDB+ Enter MongoDB in the late 2000 s. Terms like schema free, semistructured data and JSON become the norm while Entity Relationship Diagrams and DDLs become less prominent. The Three Vs defined by Gartner (Volume, Variety and Velocity) are starting to define the purpose of NoSQL vendors compared to their older relational cousins. New Databases, Great Improvements, But MongoDB and the new generation of databases brings with them great improvements but also require a new approach to analytics. MongoDB and the new generation of databases brings with them great improvements: Improved developer productivity Massive scalability Unheard of reliability Great performance by any standard Not to mention they re typically open source and, when compared to big-name vendors, can be had for pennies on the dollar for production support. But with these advancements comes a new way of thinking about your data.
Data Models? What Data Models?...No actual schema does not imply that the developer or DBA doesn t need to coordinate on the best approach... MongoDB requires no actual schema but that does not imply the developer and the DBA don t need to coordinate on the best approach to a long term solution. The advantages of NoSQL databases are often simply the result of saving and accessing data in new ways. For MongoDB that means storing complete objects in one compact area of disk (or memory, or CPU cache, or ). From a developer s perspective, however the object looks in memory is exactly how it looks stored in MongoDB. It s physically stored as a single document in a binary form of JSON referred to as BSON. Why? Consider this: it takes much less time and effort to load a truck full of parts from a single store and deliver it to a customer, than to have a truck (or multiple trucks) pick up parts from multiple stores before delivery to the same customer. The same goes for anything in our physical reality: it takes real energy and real clock cycles to perform X actions. If everything is kept in a single location there is only 1 action to perform, versus X multiplied by the number of relational tables the object is stored in. Many NoSQL databases, especially MongoDB, prefer to store data as complete objects in one location rather than discrete, normalized, small bits of information in a dozen or even a hundred different locations (tables). This is often the most difficult aspect to accept and understand when moving from relational databases to MongoDB. Remember: Not only is data stored in the same location, it is stored in a completely different format. Making the transition to MongoDB often finds developers and DBAs asking questions such as: How do I join this data? How do I sub-select? What key do I use?
Data Models? What Data Models? (cont d.) How do I create a 1/N/M to 1/N/M relationship here? How can one document (row) be completely different from the next? Why can t I get a standard schema description? Those questions aren t as important, and sometimes don t apply anymore, because remember: all the data for a single object can be stored together. Check out at the example on the next page. Say there is a database with two tables that store information about books and authors. In a relational database it would be easy to setup: Simple creation. Simple key relationship. This could work for a small or medium-sized data set without much change. It supports 1:1 (one-to-one) and 1:M (one-to-many) relationships. Adding a relationship table could add support for M:M (many-to-many) relationships. In JSON (MongoDB s document format of choice), the following data structure could be used:
Data Models? What Data Models? (cont d.) Retrieving this author document from the database also fetches the books the author wrote because those are also part of the document. This is great because it saves time by reading all the related data in a single action. But what about the flip side: what if the user wanted to find out more about the author based on a book instead? This is an important thing to note, and something MongoDB teaches their employees: the data model should be created based on expected query patterns from applications, not what the developer finds most convenient. This is the only way to ensure performance at scale.
Data Models? What Data Models? (cont d.) As an example: assume a user searches for all books with an ISBN starting with 1234. While the data model above lends itself well to finding an author and his or her books, and makes sense in a developer s object-oriented brain, it won t perform well searching for books by ISBN. Why? Because MongoDB has to look through each author document, then look at each array inside of that document to know whether it matches the user s query. An index could be added to the books array, which might help, but now adds another index MongoDB needs to write to when adding, updating or deleting data. Also be careful about adding indexes to arrays with MongoDB it can be a real problem down the road. Another option is to store book information in a separate collection and then refer to it in a second query or with a $lookup MongoDB aggregation command. Unfortunately that approach negates the performance benefits of avoiding JOINs originally. This is why the data model must be designed well in advance. MongoDB requires no actual schema but that does not imply the developer and the DBA don t need to coordinate on the best approach to a long term solution.
The Big Question: What Do I Do For Analytics on NoSQL? This is where you probably are, and what led you to reading this article. The Question How do you use existing solutions to analyze live, deeply nested, semi-structured, schema-free data in MongoDB? Everything Has Failed How do you use existing solutions to analyze live, deeply nested, semistructured, schema-free data in MongoDB? The Answer You don t. It s not possible. At all. What is needed is a way to analyze this new data format in a way that is both obvious to use but also natively understands the nested, schema-less data. Even when it changes. On the fly. No tools that exist for relational databases can do this. Not one. So then, what are the options?
Option #1: Custom Coding You ve built a really cool app using MongoDB. That s awesome! MongoDB provides drivers for every major programming language with solid documentation. They ve lowered the barrier of entry for developers to get started which is great. Getting simple data out of MongoDB isn t terribly hard. Usually. Getting meaningful data for making business decisions can be a different story. Many readers will relate to this it s a major reason this article was written. Developers can write apps quickly and utilize the JSON model to rapidly prototype ideas. It s easy to shove data into MongoDB. Getting simple data out isn t terribly hard, usually. Getting meaningful data for making business decisions can be a different story. Many readers will relate to this it s a major reason this article was written. You owe it to yourself to read this article by our CTO, John De Goes. It discusses the difficulties around creating dashboards geared toward customers and decision makers, all while avoiding the money pit that comes with custom-coded reporting. Pros Maximum flexibility Cons Everything is custom Significant MongoDB knowledge required Long term support of the custom solution Significant investment of time One-off solution for MongoDB Consider this approach when you: Have employees on the bench Already have deep knowledge of MongoDB aggregation and mapreduce functions Understand third party visualization tool integration
Option #2: ETL If you must use an existing two-dimensional (relational) database reporting tool with your multi-dimensional (nonrelational) database, this is the only option. There is simply no way to get a relational reporting tool to read, understand or display MongoDB data. Old Tools Don t Adapt. There is simply no way to get a relational reporting tool to read, understand or display MongoDB data. The Extract-Transform-Load approach has been used for a very long time with relational databases. I won t go into the details since there are better (and longer) articles elsewhere, and you ve likely investigated this route already. I will, however, give an example of why this approach is very difficult to implement from the technical perspective. Many people use MongoDB s ability to leverage the schema itself as data. What does this mean? It means you can use the field name (or column name in relational terms) as part of the data. Take the following JSON document as an example:
Option #2: ETL (cont d.) Notice the field names 2016-Q1 and 2016-Q2? We didn t list the field name as quarter and the field value as 2016-Q1. The field name is the value, and the remainder of the field value contains even more information. An ETL process takes rich, nested and self-described data and forces it into small, rigidly-typed containers so it can report in a confined, rigidly-typed way. Data fidelity will be lost during conversion with this kind of data. When considering an ETL approach readers must consider these points: How will deeply nested arrays be mapped to twodimensional tables? How will documents in the same collection, but with different schemas, be mapped? How will MongoDB schema changes be handled? Can the ETL solution handle the volume, variety and velocity of data? Will the solution scale to include new MongoDB applications and data? How fresh is the data from the ETL process? Bring your attention back to the first bullet point. Someone will need to manually map the MongoDB data model to the relational ETL model. This will need to happen again whenever the schema changes, and with MongoDB, the schema can change frequently. While scripts can be written, and some very basic tools exist that can handle the most rudimentary parts of this, the fact is that the vast majority of the data model can only be mapped by a human. Again, see the example document above.
Option #2: ETL (cont d.) Pros Most two-dimensional reporting tools can read the transformed data model. Cons Weeks or months of work to set up Cannot adjust to documents with different schemas Part-time or full-time DBA to maintain Significant investment in hardware, employee time and process Loss of data fidelity Accept that analysis is not live Consider this approach when you: Have already heavily invested in legacy reporting tools Are expected to use existing tools Can accept loss of data fidelity
Option #3: Native NoSQL Analytics with SlamData Native NoSQL analytics is a completely different approach to analytics. Given the document models shown earlier it s easy to see the disconnect that legacy relational reporting tools have. A Modern Tool for Modern Data SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. Any solution designed for multidimensional, NoSQL analytics must be designed and built for this from the ground up. Once a solution for multi-dimensional data is developed it s possible to then go back and apply it to two-dimensional data. It doesn t work the other way around. Unfortunately for existing BI vendors, the ability to natively work with MongoDB is not something that can be bolted on or included in a new version of an existing product. The best they can hope for is to mimic the ETL option. Make no mistake: reporting tools designed for relational databases will not analyze live, nested MongoDB data. Even MongoDB s official BI Connector performs an ETL process and stores data in PostgreSQL for two-dimensional analytics. SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. The SlamData product is a single analytics solution for business analysts, data scientists, developers, data architects and DBAs working with MongoDB. It natively understands dynamic, nested data and provides an interface built for it. All actions performed by SlamData occur on live data. Commands are sent to MongoDB in the most performant order based on the user s requested search. MongoDB
Option #3: Native NoSQL Analytics with SlamData Native NoSQL analytics is a completely different approach to analytics. Given the document models shown earlier it s easy to see the disconnect that legacy relational reporting tools have. A Modern Tool for Modern Data SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. Any solution designed for multidimensional, NoSQL analytics must be designed and built for this from the ground up. Once a solution for multi-dimensional data is developed it s possible to then go back and apply it to two-dimensional data. It doesn t work the other way around. Unfortunately for existing BI vendors, the ability to natively work with MongoDB is not something that can be bolted on or included in a new version of an existing product. The best they can hope for is to mimic the ETL option. Make no mistake: reporting tools designed for relational databases will not analyze live, nested MongoDB data. Even MongoDB s official BI Connector performs an ETL process and stores data in PostgreSQL for two-dimensional analytics. SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. The SlamData product is a single analytics solution for business analysts, data scientists, developers, data architects and DBAs working with MongoDB. It natively understands dynamic, nested data and provides an interface built for it. All actions performed by SlamData occur on live data. Commands are sent to MongoDB in the most performant order based on the user s requested search. MongoDB
Option #3: Native NoSQL Analytics with SlamData (cont d.) performs 100% of the computation and only the results are returned to SlamData. This is a key difference to understand: with ETL and existing BI tools an entire table (typically many tables) are returned and the solution must then perform analytics on the entire data set. A Modern Tool for Modern Data SlamData is comprised of two primary pieces: the SlamData web application and the Quasar analytics engine. Both were designed from the beginning, in tandem, to understand and interact with NoSQL data like MongoDB. With SlamData custom analytical workflows can be created by adding discrete action cards on top of one another. This allows actions such as querying MongoDB with SQL, displaying tabular reports, graphical charts, interactive forms and more. Cards can be stacked based on whatever the user is trying to accomplish. Developers can dynamically pass values into workspaces to control content and flow. DBAs can easily view schema and data. Business analysts can use standard SQL queries against MongoDB nested data. Users can interact with forms that allow self-service. All workflows, or any part of a workflow, can be securely embedded into other applications.
Option #3: Native NoSQL Analytics with SlamData (cont d.) Users can install SlamData and create dashboards on live MongoDB data in less than 60 minutes, regardless of schema. SlamData runs on Linux, OS X and Windows. It can also run on laptops, workstations, or as a server. It can run on bare metal or virtualvized. Since SlamData pushes 100% of its queries directly to the database for processing, there s no need for massive data transfers or heavy system requirements. The more optimized a MongoDB architecture is, the better SlamData runs. SlamData connects to any MongoDB database including remote instances and SSL-encrypted deployments too. Pros Immediate ROI
Option #3: Native NoSQL Analytics with SlamData (cont d.) Create embeddable reports in minutes after install Natively view, analyze and display deeply nested, semistructured data Use an enhanced SQL dialect that works on both relational and NoSQL data, instead of learning MongoDB s multiple proprietary approaches Graphically layout interactive forms, reports and charts Provide Google-like search functionality to MongoDB for end users Export data in multiple formats for custom processing Restrict data visibility and actions based on user authorization model Enterprise-grade multi-tenant security Cons Learning a new BI tool and new approach to MongoDB analytics Explaining the importance of this approach to nontechnical management Not as flexible as custom coding
There Is No Magic; It s Algebra And It s Open Source Our co-founders are sometimes heard saying something like There is no magic to SlamData, it s all there for the world to see. That s both true and a little misleading. A Completely New Technology. 100% Open Source. 100% Scalable. It takes an engineering team skilled in database technologies, mathematics, analytics, and advanced software development patterns to create a long-term, comprehensive solution. This type of solution isn t written overnight, or in a few months. It takes years to fundamentally change the way multidimensional data is modeled, understood and presented. It takes an engineering team skilled in database technologies, mathematics, analytics, and advanced software development patterns to create a long-term, comprehensive solution. So while our code can be checked out and modified on GitHub, it doesn t mean every developer who clones it will understand it. It may, in fact, look like magic. Arthur C. Clarke s Third Law comes to mind here. It s advanced. Don t believe me? Check it out for yourself.
What The Market Wants, What The Market Needs We re good at MongoDB Analytics. Really good. But we re not stopping there. We haven t spent years developing this to provide an amazing solution for just MongoDB. SlamData is the sole company that has built technology that bridges all of your data. All data sources, one analytics solution. Picture this for a moment: All data sources, one analytics solution. Picture this for a moment: All data sources, one analytics solution. Relational databases? Check. NoSQL databases? Check. XML, JSON and other nested flat file formats? Check. Cross-datasource (federated) queries and joins? Check.
What The Market Wants, What The Market Needs Relational databases? Check. NoSQL databases? Check. XML, JSON and other nested flat file formats? Check. Cross-datasource (federated) queries and joins? Check. Query and display log data, relational data and NoSQL data at the same time? Check. Pivot Tables and multidimensional data structure viewers? Check. Open Source? Check. One platform to query and analyze all data sources in your company? Check. While I m writing this we have a team of engineers writing connectors for several other databases that will be included in SlamData version 3.1. With our QScript connector technology we can create a connector for any data source (database, file, API) in a matter of weeks. You can expect several new data sources to be supported in each major release of SlamData. All with the same functionality that we currently provide for MongoDB. With it s ability to use standard SQL² across various data sources simultaneously, out-of-the-box visualizations, one-click embedding, customizable analytics workflows, enterprisegrade security, multi-tenant hosting capabilities, virtual views, interactive forms, 100% in-database query execution, and more SlamData is the only sensible approach for database analytics today.
2016 Slamdata