Reduce MongoDB Data Size. Steven Wang

Similar documents
MongoDB: Comparing WiredTiger In-Memory Engine to Redis. Jason Terpko DBA, Rackspace/ObjectRocket 1

The course modules of MongoDB developer and administrator online certification training:

How to Scale MongoDB. Apr

MongoDB Backup & Recovery Field Guide

Exploring the replication in MongoDB. Date: Oct

Scaling MongoDB. Percona Webinar - Wed October 18th 11:00 AM PDT Adamo Tonete MongoDB Senior Service Technical Service Engineer.

MATH is Hard: TTL Index Configuration and Considerations. Kimberly Wilkins Sr.

Course Content MongoDB

Time-Series Data in MongoDB on a Budget. Peter Schwaller Senior Director Server Engineering, Percona Santa Clara, California April 23th 25th, 2018

How to upgrade MongoDB without downtime

Scaling with mongodb

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

Scaling MongoDB: Avoiding Common Pitfalls. Jon Tobin Senior Systems

MongoDB. David Murphy MongoDB Practice Manager, Percona

Document Object Storage with MongoDB

MongoDB 2.2 and Big Data

MMS Backup Manual Release 1.4

Plug-in Configuration

Mike Kania Truss

Percona Live Updated Sharding Guidelines in MongoDB 3.x with Storage Engine Considerations. Kimberly Wilkins

MongoDB An Overview. 21-Oct Socrates

MongoDB Architecture

What s new in Mongo 4.0. Vinicius Grippa Percona

MONGODB INTERVIEW QUESTIONS

TS: System Center Data Protection Manager 2007, Configuring. Version 3.1

MongoDB: Replica Sets and Sharded Cluster. Monday, November 5, :30 AM - 5:00 PM - Bull

Plug-in Configuration

Breaking Barriers: MongoDB Design Patterns. Nikolaos Vyzas & Christos Soulios

Scaling for Humongous amounts of data with MongoDB

Why Choose Percona Server for MongoDB? Tyler Duzan

Group13: Siddhant Deshmukh, Sudeep Rege, Sharmila Prakash, Dhanusha Varik

MongoDB. copyright 2011 Trainologic LTD

GFS: The Google File System

MongoDB Distributed Write and Read

NoSQL BENCHMARKING AND TUNING. Nachiket Kate Santosh Kangane Ankit Lakhotia Persistent Systems Ltd. Pune, India

MongoDB Monitoring and Performance for The Savvy DBA

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Percona Live Santa Clara, California April 24th 27th, 2017

ITG Software Engineering

Sharding Introduction

Review of Morphus Abstract 1. Introduction 2. System design

GFS: The Google File System. Dr. Yingwu Zhu

Running MongoDB in Production, Part I

MongoDB CRUD Operations

VMWARE VREALIZE OPERATIONS MANAGEMENT PACK FOR. MongoDB. User Guide

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

The Google File System

MongoDB Shell: A Primer

MongoDB Tutorial for Beginners

MongoDB CRUD Operations

Maintaining the NDS Database

MongoDB Management Suite Manual Release 1.4

Designing Database Solutions for Microsoft SQL Server (465)

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Zero Downtime Migrations

Google File System and BigTable. and tiny bits of HDFS (Hadoop File System) and Chubby. Not in textbook; additional information

Become a MongoDB Replica Set Expert in Under 5 Minutes:

10. Replication. CSEP 545 Transaction Processing Philip A. Bernstein Sameh Elnikety. Copyright 2012 Philip A. Bernstein

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

Why Do Developers Prefer MongoDB?

MongoDB - a No SQL Database What you need to know as an Oracle DBA

MySQL HA Solutions. Keeping it simple, kinda! By: Chris Schneider MySQL Architect Ning.com


Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

High Performance NoSQL with MongoDB

SQL Server AlwaysOn setup on ObserveIT environment

Restore Cluster Manager VM in OpenStack. Copy the cluster manager VM snapshot to the controller blade as shown in the following command:

Use multi-document ACID transactions in MongoDB 4.0 November 7th Corrado Pandiani - Senior consultant Percona

Tracking CPS GUI and API Usage

MongoDB Schema Design for. David Murphy MongoDB Practice Manager - Percona

~3333 write ops/s ms response

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

IBM Spectrum Protect Version Introduction to Data Protection Solutions IBM

NosDB vs DocumentDB. Comparison. For.NET and Java Applications. This document compares NosDB and DocumentDB. Read this comparison to:

Google File System 2

Run your own Open source. (MMS) to avoid vendor lock-in. David Murphy MongoDB Practice Manager, Percona

Course Outline: Designing, Optimizing, and Maintaining a Database Administrative Solution for Microsoft SQL Server 2008

IBM Spectrum Protect Node Replication

IBM Tivoli Storage Manager Version Introduction to Data Protection Solutions IBM

Facebook. The Technology Behind Messages (and more ) Kannan Muthukkaruppan Software Engineer, Facebook. March 11, 2011

MongoDB Backup and Recovery Field Guide. Tim Vaillancourt Sr Technical Operations Architect, Percona

CSE 124: Networked Services Lecture-16

NoSQL Databases Analysis

Efficiently Backing up Terabytes of Data with pgbackrest. David Steele

Restore Cluster Manager VM in OpenStack. Copy the cluster manager VM snapshot to the controller blade as shown in the following command:

The Google File System

The Google File System

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

Installing SQL Server Developer Last updated 8/28/2010

Foglight for MongoDB. Cartridge Guide. Foglight for MongoDB Cartridge Guide 1

FILE SYSTEMS. CS124 Operating Systems Winter , Lecture 23

Consistency & Replication

Aurora, RDS, or On-Prem, Which is right for you

NoSQL Performance Test

GOOGLE FILE SYSTEM: MASTER Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung

MongoDB Chunks Distribution, Splitting, and Merging. Jason Terpko

Exadata Implementation Strategy

MySQL High Availability Solutions. Alex Poritskiy Percona

MongoDB on IBM Spectrum Scale

Migrating Oracle Databases To Cassandra

Transcription:

Reduce MongoDB Data Size Tangome inc Steven Wang stwang@tango.me

Outline MongoDB Cluster Architecture Advantages to Reduce Data Size Several Cases To Reduce MongoDB Data Size Case 1: Migrate To wiredtiger (High Compression Ratio) Case 2: Set TTL Index To Expire Data In Collections (Having Timestamp Field) Case 3: Purge Data Based On Hidden timestamp In _id field ( No Timestamp Field) Case 4: Use Replica Set To Purge Data And Rebuild Mongo (For large Quantity Of Data) Reclaim Disk Space Summary 2

MongoDB Cluster Architecture Each Shard has one Primary and two Secondaries 3

Advantages To Reduce Data Size ØMore data can be stored in Memory (speed up query) ØSmaller index size (speed up query) ØLess hard drive (SSD) usage (reduce cost) 4

Case 1: Migrate To WiredTiger Ø WiredTiger used document-level concurrency control for write operations. (Better Write Performance than MMAPv1) Ø WiredTiger supports compressions for all collections and indexes. (Compression minimizes storage use at the expense of additional CPU) 5

WiredTiger VS MMAPv1 6

Things to Consider Before Migration Ø Always Test the Migration Procedure on Test Environment Before Production Migration Ø Check Replica member priority: rs.conf() Ø Check MongoDB read preference Mode (from Application Configuration): Ø primary Ø primarypreferred (preferred during the migration) Ø secondary Ø secondarypreferred (for Read load performance) Ø nearest Ø Check chunk balancer status (Set it to off) Ø Check DB size and disk free space Ø Check Memory Size (use it to configure wiredtigercachesizegb Size) Ø Check CPU Usage (wiredtiger will use more CPU for compression/decompression) Ø Use Monitor tools: MMS & New Relic (monitor mongodb response time and error rate) Ø Tail mongod logs during the migration Ø Collaborations: Managers/Network Engineers/System Engineers/Software Engineers Ø Monitor the whole cluster after migration 7

WiredTiger Migration Procedure 1) Change Mongo Configuration files on Puppet Master 2) Stop Mongo Balancer from one MongoS server 3) Upgrade all mongos servers using Puppet Agent 4) Upgrade all Mongo Config Servers using Puppet Agent 5) Upgrade all Mongod Secondaries using Puppet Agent 6) Upgrade all Mongod Primaries using Puppet Agent 7) Enable Balancer 8

Compression Ratio After Migration (Use WiredTiger: snappy compressor) Custer Name Version before Version After Size before (GB) Size After (GB) Compression Ratio Cluster_1 2.8 MMAPv1 3.0.3 wiredtiger 1350 119 11.34 Cluster_2 2.8 MMAPv1 3.0.3 wiredtiger 1680 270 6.22 Cluster_3 2.8 MMAPv1 3.0.3 wiredtiger 309 22.6 13.67 Cluster_4 2.8 MMAPv1 3.0.3 wiredtiger 132 13.2 10.00 Cluster_5 2.8 MMAPv1 3.0.3 wiredtiger 234 32 7.31 9

Case 2: Set TTL Index On Timestamp Field Scenario: What if your company just want to keep 90 days data? If your collection has a field that holds values of BSON date type of an array of BSON date-types objects, then you can set TTL Index to expire the data. Examples: 1. Expire Documents after a Specified Number of Seconds Ø Create TTL Index (expire in 90 days) db.log_events.createindex( { createdat: : 1}, { expireafterseconds:7776000 } ) db.log_events.insert( { "createdat": new Date(), "logevent": 2, "logmessage": "Success!" } ) 2. Expire Documents at a Specific Clock Time Ø Create TTL Index db.log_events.createindex( { expireat: : 1}, { expireafterseconds:0 } ) Ø insert data to be expired on July 22, 2017 14:00:00 db.log_events.insert( { "expireat": new Date('July 22, 2017 14:00:00'), "logevent": 2, "logmessage": "Success!" } ) 10

TTL Index Notes Ø The background task that removes expired documents runs every 60 seconds. Ø On replica set members, the TTL background thread only deletes documents when a member is in state primary. The TTL background thread is idle when a member is in state secondary. Secondary members replicate deletion operations from the primary. Ø If collection is large, it will take a long time to create an index. Better purge data first, then create index on smaller collection, or create TTL index when create the collection Ø Restrictions TTL indexes are a single-field indexes. Compound indexes do not support TTL and ignores theexpireafterseconds option. The _id field does not support TTL indexes. You cannot use createindex() to change the value of expireafterseconds of an existing index. Instead use the collmod database command in conjunction with the index collection flag. Otherwise, to change the value of the option of an existing index, you must drop the index first and recreate. If a non-ttl single-field index already exists for a field, you cannot create a TTL index on the same field since you cannot create indexes that have the same key specification and differ only by the options. To change a non-ttl single-field index to a TTL index, you must drop the index first and recreate with theexpireafterseconds option. 11

Case 3: Purge data Using _id field Scenario: What if your company just wants to keep 90 days data? If your collection doesn t have a timestamp field, then you can still purge data using the hidden timestamp from _id field. _id: object_id consists of: a 4-byte value representing the seconds since the Unix epoch a 3-byte machine identifier a 2-byte process id a 3-byte counter, starting with a random value. 12

Sample script for Purge [root@server001 ~]$ cat purge_sample_log.js # purge data older than 2017-01-25T08:00:00.000Z use SampleLog var removeidsarray= db.sample_log.find({ _id: { $lt: ObjectId("58885b000000000000000000")}}, {_id : 1}).limit(3000).toArray().map(function(doc) { return doc._id; }); db.sample_log.remove({_id: {$in: removeidsarray}}) [root@server001 ~]$ cat loop_purge.sh #!/bin/bash i=1 while [ $i -le 50000 ] ; do /usr/bin/mongo localhost:27517 < /root/purge_sample_log.js i=$(( $i+1 )) echo $i # sleep 1 Done 13

Case 4: Purge Large Volume Collections Question: If a Mongo Cluster stores 5 years data, and you want to keep the latest 3 months data, it may take a long time (several years) to purge them using batch method. How to purge it as quick as possible? A Solutions: (Take SampleLog as an example) B step 1: take out the 1 st secondary (B) of its Replica Set as a stand alone server step 2: Disable Replication properties, Start up Mongo on B Step 3: Create a new collection, SampleLog_New Step 4: Select the latest 3 months data from SampleLog, and insert into SampleLog_New Step 5: Renew SampleLog to SampleLog_Old Step 6: Renew SampleLog_New to SampleLog Step 7: Stop Mongo on B, and Enable Replication Step 8: Add backup into Replica Set. After system is stable, remove SampleLog_Old from B Step 9: On the 2 nd Secondary (C), repeat steps 1 to 8 Step 10: On Primary A, Failover the Primary to one of Secondaries. Shutdown Mongod, Wipe out data in Data Diretory, Start Mongod. A will replicate data from Secondary B or C Step 11: Run step 1 to step 10 on other Shards C 14

Reclaim Deleted Space After Purging Notes: Mongodb won t release unused disk space, if there are lots of deletes, you need to periodically compact data to reclaim disk space. You can run db.stats() to check it Three ways to reclaim disk spaces Ø Compact Collections Ø Work at collection level: run it on each collection Ø Place a block on all other operations at the database level, so you have to plan for some downtime Ø Command: db.runcommand({compact:'collectionname'}) Ø Repair Databases Ø Work at Database level Ø Check and repair errors and inconsistencies Ø Block all other operations on your database, so need schedule a downtime Ø Need free space equivalent to the data in your database and an additional 2GB or more\ Ø Command: mongod --repair --repairpath /mnt/vol1 or db.repairdatabase() or db.runcommand({repairdatabase:1}) Ø Re-sync Instance Ø Work on instance level Ø In a replica set, unused disk space can be released by running an initial sync. Ø Steps: On Secondary: stop mongod Instance; delete data in data directory; start Mongod Instance; wait for replication to rebuild all data On Primary: Failover; stop mongod Instance; delete data in data directory; start Mongod Instance; wait for replication to rebuild all data 15

Summary Ø Use WiredTiger Storage Engine if possible Ø Purge Data by TTL if there are time-stamp fields Ø purge data by _id field Ø Use Replica Set for purging large volume of Data Ø Always remember to Reclaim deleted space after purge 16

Thank You J 17