Reduce MongoDB Data Size Tangome inc Steven Wang stwang@tango.me
Outline MongoDB Cluster Architecture Advantages to Reduce Data Size Several Cases To Reduce MongoDB Data Size Case 1: Migrate To wiredtiger (High Compression Ratio) Case 2: Set TTL Index To Expire Data In Collections (Having Timestamp Field) Case 3: Purge Data Based On Hidden timestamp In _id field ( No Timestamp Field) Case 4: Use Replica Set To Purge Data And Rebuild Mongo (For large Quantity Of Data) Reclaim Disk Space Summary 2
MongoDB Cluster Architecture Each Shard has one Primary and two Secondaries 3
Advantages To Reduce Data Size ØMore data can be stored in Memory (speed up query) ØSmaller index size (speed up query) ØLess hard drive (SSD) usage (reduce cost) 4
Case 1: Migrate To WiredTiger Ø WiredTiger used document-level concurrency control for write operations. (Better Write Performance than MMAPv1) Ø WiredTiger supports compressions for all collections and indexes. (Compression minimizes storage use at the expense of additional CPU) 5
WiredTiger VS MMAPv1 6
Things to Consider Before Migration Ø Always Test the Migration Procedure on Test Environment Before Production Migration Ø Check Replica member priority: rs.conf() Ø Check MongoDB read preference Mode (from Application Configuration): Ø primary Ø primarypreferred (preferred during the migration) Ø secondary Ø secondarypreferred (for Read load performance) Ø nearest Ø Check chunk balancer status (Set it to off) Ø Check DB size and disk free space Ø Check Memory Size (use it to configure wiredtigercachesizegb Size) Ø Check CPU Usage (wiredtiger will use more CPU for compression/decompression) Ø Use Monitor tools: MMS & New Relic (monitor mongodb response time and error rate) Ø Tail mongod logs during the migration Ø Collaborations: Managers/Network Engineers/System Engineers/Software Engineers Ø Monitor the whole cluster after migration 7
WiredTiger Migration Procedure 1) Change Mongo Configuration files on Puppet Master 2) Stop Mongo Balancer from one MongoS server 3) Upgrade all mongos servers using Puppet Agent 4) Upgrade all Mongo Config Servers using Puppet Agent 5) Upgrade all Mongod Secondaries using Puppet Agent 6) Upgrade all Mongod Primaries using Puppet Agent 7) Enable Balancer 8
Compression Ratio After Migration (Use WiredTiger: snappy compressor) Custer Name Version before Version After Size before (GB) Size After (GB) Compression Ratio Cluster_1 2.8 MMAPv1 3.0.3 wiredtiger 1350 119 11.34 Cluster_2 2.8 MMAPv1 3.0.3 wiredtiger 1680 270 6.22 Cluster_3 2.8 MMAPv1 3.0.3 wiredtiger 309 22.6 13.67 Cluster_4 2.8 MMAPv1 3.0.3 wiredtiger 132 13.2 10.00 Cluster_5 2.8 MMAPv1 3.0.3 wiredtiger 234 32 7.31 9
Case 2: Set TTL Index On Timestamp Field Scenario: What if your company just want to keep 90 days data? If your collection has a field that holds values of BSON date type of an array of BSON date-types objects, then you can set TTL Index to expire the data. Examples: 1. Expire Documents after a Specified Number of Seconds Ø Create TTL Index (expire in 90 days) db.log_events.createindex( { createdat: : 1}, { expireafterseconds:7776000 } ) db.log_events.insert( { "createdat": new Date(), "logevent": 2, "logmessage": "Success!" } ) 2. Expire Documents at a Specific Clock Time Ø Create TTL Index db.log_events.createindex( { expireat: : 1}, { expireafterseconds:0 } ) Ø insert data to be expired on July 22, 2017 14:00:00 db.log_events.insert( { "expireat": new Date('July 22, 2017 14:00:00'), "logevent": 2, "logmessage": "Success!" } ) 10
TTL Index Notes Ø The background task that removes expired documents runs every 60 seconds. Ø On replica set members, the TTL background thread only deletes documents when a member is in state primary. The TTL background thread is idle when a member is in state secondary. Secondary members replicate deletion operations from the primary. Ø If collection is large, it will take a long time to create an index. Better purge data first, then create index on smaller collection, or create TTL index when create the collection Ø Restrictions TTL indexes are a single-field indexes. Compound indexes do not support TTL and ignores theexpireafterseconds option. The _id field does not support TTL indexes. You cannot use createindex() to change the value of expireafterseconds of an existing index. Instead use the collmod database command in conjunction with the index collection flag. Otherwise, to change the value of the option of an existing index, you must drop the index first and recreate. If a non-ttl single-field index already exists for a field, you cannot create a TTL index on the same field since you cannot create indexes that have the same key specification and differ only by the options. To change a non-ttl single-field index to a TTL index, you must drop the index first and recreate with theexpireafterseconds option. 11
Case 3: Purge data Using _id field Scenario: What if your company just wants to keep 90 days data? If your collection doesn t have a timestamp field, then you can still purge data using the hidden timestamp from _id field. _id: object_id consists of: a 4-byte value representing the seconds since the Unix epoch a 3-byte machine identifier a 2-byte process id a 3-byte counter, starting with a random value. 12
Sample script for Purge [root@server001 ~]$ cat purge_sample_log.js # purge data older than 2017-01-25T08:00:00.000Z use SampleLog var removeidsarray= db.sample_log.find({ _id: { $lt: ObjectId("58885b000000000000000000")}}, {_id : 1}).limit(3000).toArray().map(function(doc) { return doc._id; }); db.sample_log.remove({_id: {$in: removeidsarray}}) [root@server001 ~]$ cat loop_purge.sh #!/bin/bash i=1 while [ $i -le 50000 ] ; do /usr/bin/mongo localhost:27517 < /root/purge_sample_log.js i=$(( $i+1 )) echo $i # sleep 1 Done 13
Case 4: Purge Large Volume Collections Question: If a Mongo Cluster stores 5 years data, and you want to keep the latest 3 months data, it may take a long time (several years) to purge them using batch method. How to purge it as quick as possible? A Solutions: (Take SampleLog as an example) B step 1: take out the 1 st secondary (B) of its Replica Set as a stand alone server step 2: Disable Replication properties, Start up Mongo on B Step 3: Create a new collection, SampleLog_New Step 4: Select the latest 3 months data from SampleLog, and insert into SampleLog_New Step 5: Renew SampleLog to SampleLog_Old Step 6: Renew SampleLog_New to SampleLog Step 7: Stop Mongo on B, and Enable Replication Step 8: Add backup into Replica Set. After system is stable, remove SampleLog_Old from B Step 9: On the 2 nd Secondary (C), repeat steps 1 to 8 Step 10: On Primary A, Failover the Primary to one of Secondaries. Shutdown Mongod, Wipe out data in Data Diretory, Start Mongod. A will replicate data from Secondary B or C Step 11: Run step 1 to step 10 on other Shards C 14
Reclaim Deleted Space After Purging Notes: Mongodb won t release unused disk space, if there are lots of deletes, you need to periodically compact data to reclaim disk space. You can run db.stats() to check it Three ways to reclaim disk spaces Ø Compact Collections Ø Work at collection level: run it on each collection Ø Place a block on all other operations at the database level, so you have to plan for some downtime Ø Command: db.runcommand({compact:'collectionname'}) Ø Repair Databases Ø Work at Database level Ø Check and repair errors and inconsistencies Ø Block all other operations on your database, so need schedule a downtime Ø Need free space equivalent to the data in your database and an additional 2GB or more\ Ø Command: mongod --repair --repairpath /mnt/vol1 or db.repairdatabase() or db.runcommand({repairdatabase:1}) Ø Re-sync Instance Ø Work on instance level Ø In a replica set, unused disk space can be released by running an initial sync. Ø Steps: On Secondary: stop mongod Instance; delete data in data directory; start Mongod Instance; wait for replication to rebuild all data On Primary: Failover; stop mongod Instance; delete data in data directory; start Mongod Instance; wait for replication to rebuild all data 15
Summary Ø Use WiredTiger Storage Engine if possible Ø Purge Data by TTL if there are time-stamp fields Ø purge data by _id field Ø Use Replica Set for purging large volume of Data Ø Always remember to Reclaim deleted space after purge 16
Thank You J 17