High Performance NoSQL with MongoDB
History of NoSQL June 11th, 2009, San Francisco, USA Johan Oskarsson (from http://last.fm/) organized a meetup to discuss advances in data storage which were all using distributed databases leveraging clusters. He asked the group for a short term they could use as a hashtag. [1] Eric Evans (not of DDD fame) proposed #NoSQL and it stuck.
Michael's NoSQL Definition Database systems which are cluster-friendly and which trade inter-entity relationships for both simplicity and performance.
Four types of "NoSQL" DBs Key Value Stores Amazon DynamoDB Redis Column-Oriented databases Hbase Cassandra Google BigQuery Graph Databases Neo4J OrientDB Document Databases MongoDB CouchDB DocumentDB (on Azure)
Key-value data storage
Column Oriented DBs
Graph DBs
Document DBs
Not so different
How much do you need perf? Image credit: nerovivo
Relational 3NF models are complex
Document DBs for simplicity Document db style
Document DBs for simplicity Document db style
Single server performance Single biggest performance problem (and fix)? Incorrect indexes (too few or too many)
Adding indexes Be data-driven: profile and then add indexes
Adding indexes Indexes are more important than for RDBMSes
Demo time
Step 1: Enable profiling
Step 2: Run common queries
Step 3: Analyze system.profile
Step 4: Add indexes for slow
Step 5: GOTO 1
Scaling out Image credit: credit: johnantoni Torkild Retvedt Image
Scaling out Scale-out is the great promise of NoSQL MongoDB has two modes of scale out Sharding Replication Real-word statistics from one company 120,000 DB operations / second 2GB of app-to-db I/O / second
Replication vs. scalability Sharding is the primary way to improve single query speed Replication is not the primary way to scale even though you may get better read performance, not much better write performance unless very read heavy Replication Server 1 A-B-C-D-E Sharding Server 1 A Server 2 A-B-C-D-E Server 3 A-B-C-D-E Server 2 B Server 3 C Server 4 A-B-C-D-E Server 5 A-B-C-D-E Server 4 D Server 5 E
Sharding...
Scaling via Sharding an example Weather data from the entire 20 th century in MongoDB Case study by MongoDB Inc: http://www.mongodb.com/presentations/weather-century-part-2-high-performance
Data size and quantity 2.5 billion data points 4 Terabyte (1.6k per document)
Sample record (JSON) { } "st" : "u725053", "ts" : ISODate("2013-06-03T22:51:00Z"), "airtemperature" : { "value" : 21.1, "quality" : "5" }, "atmosphericpressure" : { "value" : 1009.7, "quality" : "5" }
Sample record in C# class WeatherRecord { public string st {get; set;} public DateTime ts {get; set;} public Temp airtemperature {get; set;} public Pressure atmosphericpressure {get; set;} } class Temp { public int value {get; set;} public string quality {get; set;} } class Pressure { public int value {get; set;} public string quality {get; set;} }
Scale Up A single server with a really big disk Application mongod c3.8xlarge i2.8xlarge 251 GB RAM 6 TB SSD
Scale out configuration A really big cluster where everything is in RAM mongod Application / mongos c3.8xlarge... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk
Can scale even more A really big cluster where everything is in RAM Application / mongos mongod... 100 x r3.2xlarge @ 61 GB RAM 100 GB disk
Cost per year in AWS? $60,000 / yr $700,000 / yr...
Performance: single time and place db.data.find({"st" : "u747940", "ts" : ISODate("1969-07-16T12:00:00Z")}) 2 ms 1.5 1 0.5 avg 95th 99th max. throughput: 0 single server 40,000/s cluster 610,000/s (10 mongos)
Performance: 1 year's weather db.data.find({"st" : "u747940", "ts" : {"$gte": ISODate("1989-01-01"), "$lt" : ISODate("1990-01-01")}}) 5000 4000 ms 3000 2000 1000 0 single server cluster max. throughput: 20/s 430/s (10 mongos) targeted query avg 95th 99th
Analytics db.data.aggregate([ { "$match" : { "airtemperature.quality" : { "$in" : [ "1", "5" ] } } }, { "$group" : { "_id" : null, "maxtemp" : { "$max" : "$airtemperature.value" } } } ]) 61.8 C = 143 F 4 h 45 min Single Server 2 min Cluster 142x faster
Get the code and data https://github.com/mikeckennedy/sdd2016
Want to go deeper? talkpython.fm training.talkpython.fm michaelckennedy.net mikeckennedy@gmail.com @mkennedy