Architecture of a Real-Time Operational DBMS Srini V. Srinivasan Founder, Chief Development Officer Aerospike CMG India Keynote Thane December 3, 2016 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 1 ]
Real-Time Workloads [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 2 ]
Reliability at Massive Scale Developments Internet growth High rate of transactions Millions/second Storage advances expansion of DRAM, rise of SSDs New distributed consensus algorithms e.g., CAP Theorem, Paxos Application developers prefer not using SQL Python, Go, PHP, Traditional DB Guarantee strong consistency to replicated data Limited in scalability and availability Cannot handle network partitions NoSQL DB Focus on massive scalability and high availability Use concepts from Operating Systems and Distributed Systems [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 3 ]
SQL è NoSQL SQL databases are architected for Disk oriented storage and indexing structures Multithreading to hide latency Locking-based concurrency control mechanisms Log-based recovery NoSQL or NewSQL databases are architected for In-Memory Incremental upgrades (no fork-lift upgrade) High Availability Self-managing (self-healing, self-maintaining, self-tuning) [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 4 ]
Next Generation of Database Systems Speed at Scale Many Choices Scaling up affordably TCO Restricted Functionality Real-time Problem Set Observed TCO Speed TPS Rich functionality nonreal-time use cases TCO ($) Desired TCO Scale TB Scale TB TCO - Total cost of ownership Goal: Deliver Predictable Performance, Highest Availability, and Lowest TCO [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 5 ]
Use Cases [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 6 ]
Billion Dollar Advertising Market uses RTB 1 to 6 billion cookies tracked Auctions at about 3.0M / sec in North America 100ms ad rendering, 50ms real-time bidding Low Latency, High Throughput, High Uptime [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 7 ]
RTB Tech Stack [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 8 ]
Beyond Ad Tech FinServ Marketing Tech Telco AdTech Gaming [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 9 ]
Financial Services Tech Stack [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 10 ]
Fraud Detection Tech Stack [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 11 ]
Telco Tech Stack [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 12 ]
Operational Scale in Enterprises Decisioning Engine BUSINESS TRANSACTIONS LEGACY RDBMS HDFS BASED Web views ( Payments ) ( Mobile Queries ) ( Recommendation ) ( And More ) High Performance NoSQL XDR REAL-TIME BIG DATA DECISIONING 500 Business Trans per sec LEGACY DATABASE (Mainframe) DATA WAREHOUSE/ DATA LAKE X 5000 = 2.5 M Calculations per sec Database Transactions per sec [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 13 ]
Technology [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 14 ]
Architecture Overview 1) No Hotspots Distributed Hashing simplifies data partitioning 2) Smart Client 1 hop to data, load balancing 3) Shared Nothing Architecture every node is identical 4) Smart Clustering auto-sharding, auto-failover, auto-rebalancing, rack aware, rolling upgrades 5) Transactions and long-running tasks prioritized in realtime 6) XDR sync replication across data centers ensures near Zero Downtime [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 15 ]
Cluster Formation Say N1 is seed node and N3 is the paxos principal N2 and N3 send themselves in list to N1; N1 discovers them N1 sends adjacency list [N1, N2, N3] to newly discovered node N3 (and also N2) N3 discovers N2 and starts sending the cluster node list [N3, N2, N1] to N1 and N2 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 16 ]
Distributed Hash Based Partitioning Distributed Hashing with No Hotspots Every key hashed with RIPEMD160 into an ultra efficient 20 byte (fixed length) string Hash + additional (fixed 64 bytes) data forms index entry in RAM Some bits from hash value are used to calculate the Partition ID (4096 partitions) Partition ID maps to Node ID in the cluster [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 17 ]
Data Distribution Index and data are colocated 1. Distribute workload uniformly 2. Provide predictable read/write performance 3. Scale up and down by simply adding cluster nodes 4. Rebalance data non-disruptively and efficiently Partition assignment objectives 1. Deterministic, so each node can operate by itself 2. Uniform distribution of partitions across nodes 3. Minimize partition moves during cluster changes [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 18 ]
Partition Assignment Algorithm function REPLICATION_LIST_ASSIGN(partitionid) node_hash = empty map for nodeid in succession_list: node_hash[nodeid] = NODE_HASH_COMPUTE(nodeid, partitionid) replication_list = sort_ascending(node_hash using hash) return replication_list function NODE_HASH_COMPUTE(nodeid, partitionid): nodeid_hash = fnv_1a_hash(nodeid) partition_hash = fnv_1a_hash(partitionid) return jenkins_one_at_a_time_hash(<nodeid_hash, partition_hash>) [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 19 ]
Real-Time Prioritization master replica Writing with Immediate Consistency Adding a Node transactions continue 1. Write sent to row master 2. Latch against simultaneous writes 3. Apply write to master and replica synchronously 4. Queue operations to disk 5. Signal completed transaction 6. Master merges duplicate copies (if any) 1. Cluster discovers new node via gossip protocol 2. Paxos vote determines new data organization 3. Partition migrations scheduled (only deltas copied) 4. When a partition migration starts, write journal starts on destination 5. Partition moves atomically 6. Journal is applied and source data deleted [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 20 ]
Intelligent Client The Aerospike Client is implemented as a library, JAR or DLL, and consists of 2 parts: Operation APIs These are the operations that you can execute on the cluster CRUD+ etc. First class observer of the Cluster Monitoring the state of each node and aware of new nodes or node failures. 1 Hop to data Smart Client simply calculates Partition ID to determine Node ID Client performs load balancing [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 21 ]
Designed for Wire-Line Speed Multi-core architecture Optimized C based DB kernel 1. Multi-threaded data structures 2. Nested locking model for synchronization 3. Lockless data structures 4. Partitioned single threaded data structures 5. Index entries are aligned to cache line (64 bytes) 6. Custom memory management (arenas) Memory Arena Assignment [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 22 ]
In-Memory and Hybrid-Memory Storage Architecture 1. Direct device access 2. Large Block Writes 3. Indexes in DRAM 4. Highly Parallelized Highlights 5. Log-structured FS copy-on-write 6. Fast restart with shared memory Storage Layout [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 23 ]
Benchmarks [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 24 ]
Hybrid-Memory Performance HIGH THROUGHPUT LOW LATENCY 350,000 300,000 Balanced Workload Read Latency Throughput, ops/second 250,000 200,000 150,000 100,000 50,000 0 Balanced Read-Heavy Average Latency, ms 10 7.5 5 2.5 0 0 50,000 100,000 150,000 200,000 Throughput, ops/sec Aerospike Cassandra Aerospike Balanced 50/50 read-write ratio Read-Heavy 95/5 read-write ratio [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 25 ]
In-Memory performance YCSB Benchmark 50 million records YCSB Workload A (50/50 R/W) YCSB Workload B (95/5 R/W) Zipfian key distribution 8 Core Dual Socket Intel Xeon CPU E5-2665@2.4GHz 32GB DRAM with 16 queues [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 26 ]
DRAM Vs SSD on GCE GCE instance n1-standard-8 10 node cluster 150 byte record with 3 columns 100 million records [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 27 ]
Predictable Performance During Failures 1 2 3 4 5 Phases 1) 100KTPS 4 nodes 2) Clients at Max 3) 400KTPS 4 nodes 4) 400KTPS 3 nodes 5) 400KTPS 4 nodes Aerospike Node Specs: CentOS 6.3 Intel i5-2400@ 3.1 GHz (Quad core) 16 GB RAM@1333 MHz [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 28 ]
TCO: In-Memory Vs Hybrid-Memory Actual deployment analysis. Deployment requires 500K TPS, 10 TB of storage, with 2x replication factor. In-Memory System 186 SERVERS REQUIRED Hybrid-Memory System ONLY 14 SERVERS REQUIRED 14 SERVERS REQUIRED Storage per server 180 GB (196 GB Server) 2.4 TB (4 x 700 GB) TPS per cluster 500,000 500,000 Cost per server $8,000 $11,000 Server costs $1,488,000 $154,000 Power/server 0.9 kw 1.1 kw Power (2 years) $0.12 per kwh ave. US $352,000 $32,400 Maintenance (2 years) $3,600 per server $670,000 $50,400 Total $2,510,000 $236,800 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 29 ]
Future Work Software Application Requirements New Hardware Linearize with CP Mode Eventual consistency with conflict detection and resolution Pipelined execution of client transactions for increased performance Security enhancements Customers demand Real-time decisions based on recent data High Consistency Security 3D XPoint High core CPUs NVMe Multi-queue network cards Virtualized IO 30 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 30 ]
Thank You Questions? [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 31 ]