CockroachDB on DC/OS Ben Darnell, CTO, Cockroach Labs
Agenda A cloud-native database CockroachDB on DC/OS Why CockroachDB Demo!
Cloud-Native Database
What is Cloud-Native? Horizontally scalable Individual machines aren t special Built to handle failures
Why? Adapt to Change Grow and shrink based on demands Easy, rapid deployment Reorganize to reduce load and latency Load follows the sun
Traditional Databases More comfortable scaling vertically than horizontally Primaries vs secondaries vs read-only replicas Failover is often manual Asynchronous replication may lose data
Cloud-Native CockroachDB Add nodes with automatic rebalancing Any node can serve any role Automatic failover with consistent replication
CockroachDB on DC/OS
DC/OS Provides an Ideal Environment Elastic allocation and scheduling Discovery, VIPs, and load balancing Weakly persistent local storage
Storage Dilemma RAID or network-attached storage is expensive CockroachDB supplies its own redundancy Naive container deployments could lose all local volumes at once DC/OS SDK provides persistent scheduling
DC/OS Package dcos package install cockroachdb Starts a 3-node cluster Add more later by changing NODE_COUNT variable Or one-click install through the web
Why CockroachDB? An open source SQL database for global cloud services.
Why CockroachDB? Distributed SQL Data integrity at global scale Multi-active availability Consistent transactions
Distributed SQL A database that grows with your application
What is Distributed SQL? SQL with ACID semantics Runs across multiple database servers Distributed storage Distributed compute
Distributed Storage Tables scale to any size No manual partitioning necessary Automatic rebalancing as machines are added
Distributed Execution Nodes with data compute their portion of results SELECT SUM(duration) FROM sessions
Flexible Scheduling Schedule CockroachDB nodes......near your application servers for lower latency...or on your application nodes when you have underutilized disks
Global Distribution Span three datacenters to survive major outages Put data close to your customers
Multi-Active Availability Survive disasters
Disaster Recovery, Version 1.0 DC1 DC2 microservices{ Remote Backup
Recovery 2.0: Active/Passive DC1 DC2 async Primary Secondary
Recovery 3.0: Active/Active DC1 DC2
Multi-Active Availability DC1 DC2 consensus consensus consensus DC3
Consistent Transactions Data Integrity at Global Scale
Consensus Replication A majority of replicas must agree to commit Failovers never lose committed data Raft consensus algorithm
Distributed Transactions Transactions can span rows, tables, even databases ACID semantics Serializable isolation
Demo
Status
Current Version: 1.0.6 First production-ready release in May 2017 Distributed SQL Multi-active availability Enterprise: Incremental, distributed backup/restore
Very Soon: 1.1 Ruggedization Diagnostic & debugging tools Performance SQL coverage Fast data import (CSV)
Up Next Performance SQL feature parity (JSONB) Change data capture (via Kafka) Global data architecture Enterprise: row-level partitioning
The End https://www.cockroachlabs.com https://github.com/cockroachdb/cockroach https://github.com/dcos/examples https://gitter.im/cockroachdb/cockroach
Appendix
Target Customers Is CockroachDB right for you?
Fastest growing companies in the European Gaming Company world are using CockroachDB to solve their global data challenges $2B+ gaming company that uses CockroachDB to improve performance and availability for their global customer base. $60B+ web services company that uses CockroachDB to power their knowledge graph Real-time gaming platform that uses CockroachDB as the enabling technology to deliver their PaaS Proprietary & Confidential
Is CockroachDB right for you? Global, scalable cloud services We hope this eventually means everyone Not for: Latency-sensitive writes Unstructured data Efficient OLAP processing (for now) Analytics in a non-sql language (mapreduce)
Correctness or Performance? We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions Authors of Google s Spanner paper
CockroachDB Internals Thorax, wings, eyes, legs,...
Layered architecture SQL Transactional KV Distribution Replication Storage
Data Distribution SQL Transactional KV Distribution Replication Storage
Data Distribution Data is a large collection of Key:Value pairs Data split into Ranges of Key:Value pairs Ranges are ~64 MB of data CockroachDB/Bigtable/HBase/Spanner
Data Distribution Options for finding data Hashing Order-Preserving
Order-preserving data distribution Pro: efficient scans over total ordering of data Con: requires additional indexing
Keyspace as Monolithic, Sorted Map
Ranges Form the Keyspace
Ranges Map to Nodes
Data Lookup
Consensus SQL Transactional KV Distribution Replication Storage
Consensus Replication Replicate to N (where N >= 3) nodes Commit happens when a quorum have written the data CockroachDB/Etcd/Spanner/Aurora/Zookeeper/...
Consensus Replication Raft is our consensus protocol of choice. Run one consensus group per range of data
Consensus Replication Put cherry Leader apricot banana blueberry cherry grape Follower apricot banana blueberry Follower apricot banana blueberry grape grape
Consensus Replication Put cherry Leader apricot Put cherry banana blueberry cherry grape Follower apricot banana blueberry Follower apricot banana blueberry grape grape
Consensus Replication Put cherry Leader apricot Put cherry banana blueberry cherry grape Write committed when 2 out of 3 nodes have written data Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape
Consensus Replication Put cherry Leader Put cherry apricot Ack banana blueberry cherry grape Ack Follower apricot banana blueberry Follower apricot banana blueberry cherry grape grape
Consensus Replication Put cherry Leader apricot Put cherry banana blueberry cherry grape Ack Follower apricot banana blueberry Follower apricot banana blueberry cherry cherry grape grape
Consensus Replication What happens during failover?
Consensus Replication: Failover Put cherry Leader apricot banana blueberry cherry grape Follower apricot banana blueberry Follower apricot banana blueberry grape grape
Consensus Replication: Failover Put cherry Leader Elect a new Leader apricot banana blueberry cherry grape Follower apricot banana blueberry Follower apricot banana blueberry grape grape
Consensus Replication: Failover Follower apricot Leader apricot banana blueberry On failover, only data written to a quorum is considered present Follower apricot banana grape banana blueberry blueberry cherry grape grape
Consensus Replication: Failover Read cherry Key not found Follower apricot Leader apricot banana blueberry Follower apricot banana grape banana blueberry blueberry cherry grape grape
Consensus Replication Consensus provides atomic writes But only for each range What about operations that hit multiple ranges?
Distributed Transactions SQL Transactional KV Distribution Replication Storage
Transactions Supports ACID semantics All-or-nothing Serializable [default] & Snapshot isolation levels
Distributed Transactions Transaction record Keyed by a transaction UUID Located on first written range Atomic commit or abort via write to txn record One phase commit fast path
Distributed Transactions 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) grape
Distributed Transactions 1. Begin Txn 1 Put cherry 2. Put mango Txn-1 Record Status: PENDING apricot cherry (txn 1) grape lemon mango (txn 1) orange
Distributed Transactions 1. Begin Txn 1 Put cherry 2. Put mango 3. Commit Txn 1 Txn-1 Record Status: COMMITTED apricot cherry (txn 1) grape lemon mango (txn 1) orange
Distributed Transactions 1. Begin Txn 1 Put cherry 2. Put mango 3. Commit Txn 1 4. Clean up intents Txn-1 Record Status: COMMITTED apricot cherry grape lemon mango orange
Distributed Transactions 1. Begin Txn 1 Put cherry 2. Put mango 3. Commit Txn 1 4. Clean up intents 5. Remove Txn 1 apricot cherry grape lemon mango orange
Distributed Transactions That s the happy case What about conflicting transactions?
Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) grape
Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) 1. Begin Txn 2 2. Get cherry grape
Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) 1. Begin Txn 2 2. Get cherry Check Txn-1 record grape
Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) 1. Begin Txn 2 2. Get cherry Check Txn-1 record Wait while PENDING grape
Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry 2. Commit Txn1 Txn-1 Record Status: COMMITTED apricot cherry (txn 1) 1. Begin Txn 2 2. Get cherry Check Txn-1 record Wait while PENDING grape
Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry 2. Commit Txn1 Txn-1 Record Status: COMMITTED apricot cherry (txn 1) 1. Begin Txn 2 2. Get cherry Check Txn-1 record Wait while PENDING 3. Commit Txn2 grape
Distributed Transactions What about write / write conflicts?
Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) grape
Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING Txn-2 Record Status: PENDING 1. Begin Txn 2 2. Put cherry apricot cherry (txn 1) grape
Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING Txn-2 Record Status: PENDING 1. Begin Txn 2 2. Put cherry Check Txn 1 record apricot cherry (txn 1) grape
Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING Txn-2 Record Status: PENDING 1. Begin Txn 2 2. Put cherry Check Txn 1 record Wait while PENDING apricot cherry (txn 1) grape
Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry 2. Commit Txn 1 Txn-1 Record Status: COMMITTED Txn-2 Record Status: PENDING 1. Begin Txn 2 2. Put cherry Check Txn 1 record Wait while PENDING apricot cherry (txn 1) grape
Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry 2. Commit Txn 1 Txn-1 Record Status: COMMITTED Txn-2 Record Status: PENDING 1. Begin Txn 2 2. Put cherry Check Txn 1 record Resolve cherry (txn 1) Put cherry apricot cherry (txn 2) cherry grape
Click to edit title Click to edit text Second level Third level Fourth level» Fifth level