Time-Series Data in MongoDB on a Budget. Peter Schwaller Senior Director Server Engineering, Percona Santa Clara, California April 23th 25th, 2018

Similar documents
The course modules of MongoDB developer and administrator online certification training:

How to Scale MongoDB. Apr

Scaling for Humongous amounts of data with MongoDB

Reduce MongoDB Data Size. Steven Wang

Your First MongoDB Environment: What You Should Know Before Choosing MongoDB as Your Database

Scaling MongoDB. Percona Webinar - Wed October 18th 11:00 AM PDT Adamo Tonete MongoDB Senior Service Technical Service Engineer.

Course Content MongoDB

Mike Kania Truss

Percona Live Updated Sharding Guidelines in MongoDB 3.x with Storage Engine Considerations. Kimberly Wilkins

MongoDB 2.2 and Big Data

Use multi-document ACID transactions in MongoDB 4.0 November 7th Corrado Pandiani - Senior consultant Percona

MongoDB Shootout: MongoDB Atlas, Azure Cosmos DB and Doing It Yourself

Scaling with mongodb

How to upgrade MongoDB without downtime

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Document Object Storage with MongoDB

Bringing code to the data: from MySQL to RocksDB for high volume searches

Time Series Live 2017

Introduction to Database Services

MATH is Hard: TTL Index Configuration and Considerations. Kimberly Wilkins Sr.

MongoDB Schema Design for. David Murphy MongoDB Practice Manager - Percona

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

MongoDB Revs You Up: What Storage Engine is Right for You?

Aurora, RDS, or On-Prem, Which is right for you

Fast, In-Memory Analytics on PPDM. Calgary 2016

MongoDB Backup and Recovery Field Guide. Tim Vaillancourt Sr Technical Operations Architect, Percona

MongoDB CRUD Operations

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

What s new in Mongo 4.0. Vinicius Grippa Percona

Breaking Barriers: MongoDB Design Patterns. Nikolaos Vyzas & Christos Soulios

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

WiredTiger In-Memory vs WiredTiger B-Tree. October, 5, 2016 Mövenpick Hotel Amsterdam Sveta Smirnova

MongoDB Storage Engine with RocksDB LSM Tree. Denis Protivenskii, Software Engineer, Percona

MongoDB CRUD Operations

MongoDB: Comparing WiredTiger In-Memory Engine to Redis. Jason Terpko DBA, Rackspace/ObjectRocket 1

Why Choose Percona Server for MongoDB? Tyler Duzan

MongoDB for a High Volume Logistics Application. Santa Clara, California April 23th 25th, 2018

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

MySQL Database Scalability

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Real-Time & Big Data GIS: Best Practices. Josh Joyner Adam Mollenkopf

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

ITG Software Engineering

COMP283-Lecture 3 Applied Database Management

Distributed Filesystem

Efficient Data Structures for Tamper-Evident Logging

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

Percona Live Santa Clara, California April 24th 27th, 2017

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

5 Fundamental Strategies for Building a Data-centered Data Center

Open Source Database Performance Optimization and Monitoring with PMM. Fernando Laudares, Vinicius Grippa, Michael Coburn Percona

The Google File System

Scaling MongoDB: Avoiding Common Pitfalls. Jon Tobin Senior Systems

The Google File System

System Requirements EDT 6.0. discoveredt.com

Innodb Performance Optimization

Real-Time & Big Data GIS: Best Practices. Suzanne Foss Josh Joyner

Database Architectures

CS November 2017

DBMS Data Loading: An Analysis on Modern Hardware. Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

CS November 2018

Cloud Backup and Recovery for Healthcare and ecommerce

DELL EMC DATA DOMAIN SISL SCALING ARCHITECTURE

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

Cloudian Sizing and Architecture Guidelines

Hammer Slide: Work- and CPU-efficient Streaming Window Aggregation

NoSQL Performance Test

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Kinetic drive. Bingzhe Li

MongoDB Distributed Write and Read

Discover the all-new CacheMount

AN ALTERNATIVE TO ALL- FLASH ARRAYS: PREDICTIVE STORAGE CACHING

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

MMS Backup Manual Release 1.4

MongoDB: Replica Sets and Sharded Cluster. Monday, November 5, :30 AM - 5:00 PM - Bull

Highway to Hell or Stairway to Cloud?

IBM Emulex 16Gb Fibre Channel HBA Evaluation

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

CS3600 SYSTEMS AND NETWORKS

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein. Copyright 2003 Philip A. Bernstein. Outline

Running MySQL on AWS. Michael Coburn Wednesday, April 15th, 2015

MongoDB Schema Design

Splunk is a great tool for exploring your log data. It s very powerful, but

Boost Performance and Extend NAS Life

How to Pick SQL Server Hardware

Oracle TimesTen In-Memory Database 18.1

POSTGRESQL ON AWS: TIPS & TRICKS (AND HORROR STORIES) ALEXANDER KUKUSHKIN. PostgresConf US

Chapter 12: File System Implementation

MongoDB Backup & Recovery Field Guide

Kathleen Durant PhD Northeastern University CS Indexes

GLUSTER CAN DO THAT! Architecting and Performance Tuning Efficient Gluster Storage Pools

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Deduplication Storage System

MongoDB An Overview. 21-Oct Socrates

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Transcription:

Time-Series Data in MongoDB on a Budget Peter Schwaller Senior Director Server Engineering, Percona Santa Clara, California April 23th 25th, 2018

TIME SERIES DATA in MongoDB on a Budget Click to add text

What is Time-Series Data? Characteristics: Arriving data is stored as a new value as opposed to overwriting existing values Usually arrives in time order Accumulated data size grows over time Time is the primary means of organizing/accessing the data 3

Time Series Data in MONGODB on a Budget Click to add text

Why MongoDB? General purpose database Specialized Time-Series DBs do exist Do not use mmap storage engine 5

Data Retention Options Purge old entries Set up MongoDB index with TTL option (be careful if this index is your shard key) Aggregate data and store summaries Create summary document, delete original raw data Huge compression possible (seconds->minutes->hours->days->months->years) Measurement buckets Store all entries for a time window in a single document Avoids storing duplicate metadata Individual Documents for Each Measurement Useful when data is sparse or intermittent (e.g., events rather than sensors) 6

Potential Problems with Data Collection Duplicate entries Utilize unique index in MongoDB to reject duplicate entries Delayed Out of order 7

Problems with Delayed and Out-of-Order Entries Alert/Event generation Incremental Backup 8

Enable Streaming of Data Add recordedtime field (in addition to existing field with timestamp) Utilize $currentdate feature of db.collection.update() $currentdate: { recordedtime: true } You cannot use this field as a shard key! Requires use of update instead of insert Which in turn requires specification of _id field Consider constructing your _id to solve the duplicate entries issue at the same time Allows applications to reliably process each document once and only once. 9

Accessing Your Data It s only *mostly* write-only.

Create Appropriate Indexes Avoid collection scans! Consider using: db.admincommand( { setparameter: 1, notablescan: 1 } ) Avoid queries that might as well be collection scans Create the indexes you need (but no more) Don t depend on index intersection Don t over index Each index can take up a lot of disk/memory Consider using partial indexes { partialfilterexpression: { speed: { $gt: 75.0 } } } 11

Check Your Indexes Use.explain() liberally Check which indexes are actually used: db.collection.aggregate( [ { $indexstats: {}}]) 12

Adding Data Getting the Speed You Need

API Methods Insert array database[collection].insert(doc_array) Insert unordered bulk bulk = database[collection].initialize_unordered_bulk_op() bulk.insert(doc) # loop here bulk.execute() Upsert unordered bulk bulk = database[collection].initialize_unordered_bulk_op() bulk.find({"_id": doc["_id"]}).upsert().update_one({"$set": doc}) # loop here bulk.execute() Insert single database[collection].insert(doc) Upsert single database[collection].update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=true) 14

Relative Performance 40000 Comparison of API Methods 35000 30000 25000 20000 15000 10000 5000 0 Insert Array Insert Unordered Bulk Update Unordered Bulk Insert Single Update Single Docs/Sec 15

Benchmarks and other lies. Answering, Why can t I just use a gigantic HDD RAID array?

Benchmark Environment VMs 4 core Intel(R) Xeon(R) CPU E3-1246 v3 @ 3.50GHz 8 GB RAM Sandisk Ultra II 960GB SSD WD 5TB 7200rpm HDD MongoDB 3.4.13 WiredTiger 4GB Cache Snappy collection compression Standalone server (no replica set, no mongos) Data 178 bytes per document in 6 fields 3 indexes (2 compound) Disk usage: 40% storage, 60% indexes Using update unordered bulk method, 1000 docs per bulk.execute() 17

Benchmark SSD vs. HDD 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Inserts/Sec SSD HDD 18

SSD Benchmark 60 Minutes 19

SSD Benchmark 0:30-1:00 20

HDD Benchmark 0:30-1:30 21

HDD Benchmark 0:30-8:45 (42M documents) 22

HDD Benchmark Last Hour 23

SSD Benchmark 0:30-2:10 (42M documents) 24

Benchmark SSD vs. HDD Last Hour 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Inserts/Sec SSD HDD 25

96 Hour Test 26

TL;DR Don t trust someone else s benchmarks (especially mine!) Benchmark using your own schema and indexes Artificially accelerate index size exceeding available memory 27

Time Series Data in MongoDB on a BUDGET

Replica Set Rollout Options Follow standard advice 3 server replica sets (Primary, Secondary, Secondary) Every replica set server on its own hardware Disk mirroring Cost cutting options Primary, Secondary, Arbiter Locate multiple replica set servers on the same hardware (but NOT from the SAME replica set) No disk mirroring (how many copies do you really need?) I love downtime and don t care about my data Single instance servers instead of replica sets RAID0 ( no wasted disk space! ) No backups 29

Storing Lots of Data Sharding is a method for distributing data across multiple machines. MongoDB uses sharding to support deployments with very large data sets and high throughput operations.

Conventional Sharding Non-sharded data kept in default replica set Shard key hashed on timestamp to evenly distribute data Pros: Increases insert rate Arbitrarily large data storage Cons: All shard replica sets should have comparable hardware All shards start thrashing at the same time Expanding means a LOT of rebalancing 31

Data Access Patterns New writes are always very recent Reads are almost always of recent data Reads of old data are intuitively slower let s take advantage of that. 32

Sharding by Zone Non-sharded data kept in default replica set Most recent time-series data stored in fast replica set Older time-series data stored in slow replica sets Pros: Pay for speed where we need it Swap fast to slow before thrashing kills performance Infinite data size Cons: Ceiling on insert speed 33

Prerequisites for Zone Sharding Sharded cluster configured (config replica set, mongos, etc) Existing replica set rsmain (primary shard) contains your normal (not timeseries) data TimeSeries collection with an index on time New replica set for time-series data (e.g., rs001) added as a shard 34

Initial Zone Ranges Run on mongos: use admin sh.enablesharding( DBName ) sh.shardcollection( DBName.TimeSeries, { time : 1 } ) sh.addshardtag('rsmain', future') sh.addshardtag( rs001', ts001') sh.addtagrange('dbname.timeseries',{time: new Date("2099-01-01")}, {time:maxkey},'future') sh.addtagrange( DBName.TimeSeries',{time:MinKey},{time:new Date("2099-01-01")}, ts001') # sh.splitat('dbname.timeseries', {"time" : new Date("2099-01-01")}) 35

Adding a New Time-Series Replica Set Step 1 Create new Replica Set When? Well before you run out of available fast storage Before your input capacity is lowered too close to your needs Where? On the same server with fast storage as the current time-series replica set Run on mongos: use admin db.runcommand({addshard: rs002/hostname:port", name: "rs002"}) sh.addshardtag( rs002, ts002') var configdb=db.getsiblingdb("config"); configdb.tags.update({tag: ts001"},{$set:{'max.time': new ISODate( 2018-04-26 ) }}) sh.addtagrange( DBName.TimeSeries',{time:new Date("2018-04-26")},{time:new Date("2099-01- 01")}, ts002') # sh.splitat('dbname.timeseries', {"time" : new ISODate("2018-04-26")}) 36

Adding a New Time-Series Replica Set Step 2 Wait before Relocation Initially nothing changes all data is added into previous replica set Eventually, new entries match the min.time of the new replica set and will be stored there How long to wait before relocation? Make sure you don t fill up your fast storage How far back in time do normal queries go? - Queries to previous replica set will get slower after relocation 37

Adding a New Time-Series Replica Set Step 3 Relocate to Slow Storage Follow standard procedure for moving replica set Multiple server instances can share same server/storage Use unique ports Set wiredtigercachesizegb appropriately 38

Pause for Questions

Wrap Up 1. Determine your anticipated time-series data rate 2. Mock up a benchmark app matching your use-case Focus on indexed fields and their cardinality 3. Benchmark on a single server Fast storage Limited memory to accelerate index thrashing Ensure benchmarks run long enough 4. Iterate adjusting the following tradeoffs: single vs bulk/array upsert vs insert size of bulk/array insert/upsert if using measurement buckets, adjust size of bucket 5. If you achieve your needed data rate, use shard tags to push old data to slower (cheaper) servers 40

Rate My Session 41

Thank You Sponsors!! 42

Thank You!