High Performance NoSQL with MongoDB

History of NoSQL June 11th, 2009, San Francisco, USA Johan Oskarsson (from http://last.fm/) organized a meetup to discuss advances in data storage which were all using distributed databases leveraging clusters. He asked the group for a short term they could use as a hashtag. [1] Eric Evans (not of DDD fame) proposed #NoSQL and it stuck.

Michael's NoSQL Definition Database systems which are cluster-friendly and which trade inter-entity relationships for both simplicity and performance.

Four types of "NoSQL" DBs Key Value Stores Amazon DynamoDB Redis Column-Oriented databases Hbase Cassandra Google BigQuery Graph Databases Neo4J OrientDB Document Databases MongoDB CouchDB DocumentDB (on Azure)

Key-value data storage

Column Oriented DBs

Graph DBs

Document DBs

Not so different

How much do you need perf? Image credit: nerovivo

Relational 3NF models are complex

Document DBs for simplicity Document db style

Single server performance Single biggest performance problem (and fix)? Incorrect indexes (too few or too many)

Adding indexes Be data-driven: profile and then add indexes

Adding indexes Indexes are more important than for RDBMSes

Demo time

Step 1: Enable profiling

Step 2: Run common queries

Step 3: Analyze system.profile

Step 4: Add indexes for slow

Step 5: GOTO 1

Scaling out Image credit: credit: johnantoni Torkild Retvedt Image

Scaling out Scale-out is the great promise of NoSQL MongoDB has two modes of scale out Sharding Replication Real-word statistics from one company 120,000 DB operations / second 2GB of app-to-db I/O / second

Replication vs. scalability Sharding is the primary way to improve single query speed Replication is not the primary way to scale even though you may get better read performance, not much better write performance unless very read heavy Replication Server 1 A-B-C-D-E Sharding Server 1 A Server 2 A-B-C-D-E Server 3 A-B-C-D-E Server 2 B Server 3 C Server 4 A-B-C-D-E Server 5 A-B-C-D-E Server 4 D Server 5 E

Sharding...

Scaling via Sharding an example Weather data from the entire 20 th century in MongoDB Case study by MongoDB Inc: http://www.mongodb.com/presentations/weather-century-part-2-high-performance

Data size and quantity 2.5 billion data points 4 Terabyte (1.6k per document)

Sample record (JSON) { } "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airtemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericpressure" : { "value" : 1009.7, "quality" : "5" }

Sample record in C# class WeatherRecord { public string st {get; set;} public DateTime ts {get; set;} public Temp airtemperature {get; set;} public Pressure atmosphericpressure {get; set;} } class Temp { public int value {get; set;} public string quality {get; set;} } class Pressure { public int value {get; set;} public string quality {get; set;} }

Scale Up A single server with a really big disk Application mongod c3.8xlarge i2.8xlarge 251 GB RAM 6 TB SSD

Scale out configuration A really big cluster where everything is in RAM mongod Application / mongos c3.8xlarge... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk

Can scale even more A really big cluster where everything is in RAM Application / mongos mongod... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk

Cost per year in AWS? $60,000 / yr $700,000 / yr...

Performance: single time and place db.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")}) 2 ms 1.5 1 0.5 avg 95th 99th max. throughput: 0 single server 40,000/s cluster 610,000/s (10 mongos)

Performance: 1 year's weather db.data.find({"st" : "u747940", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}}) 5000 4000 ms 3000 2000 1000 0 single server cluster max. throughput: 20/s 430/s (10 mongos) targeted query avg 95th 99th

Analytics db.data.aggregate([ { "$match" : { "airtemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxtemp" : { "$max" : "$airtemperature.value" } } } ]) 61.8 C = 143 F 4 h 45 min Single Server 2 min Cluster 142x faster

Get the code and data https://github.com/mikeckennedy/sdd2016

Want to go deeper? talkpython.fm training.talkpython.fm michaelckennedy.net mikeckennedy@gmail.com @mkennedy