CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

CockroachDB on DC/OS Ben Darnell, CTO, Cockroach Labs

Agenda A cloud-native database CockroachDB on DC/OS Why CockroachDB Demo!

Cloud-Native Database

What is Cloud-Native? Horizontally scalable Individual machines aren t special Built to handle failures

Why? Adapt to Change Grow and shrink based on demands Easy, rapid deployment Reorganize to reduce load and latency Load follows the sun

Traditional Databases More comfortable scaling vertically than horizontally Primaries vs secondaries vs read-only replicas Failover is often manual Asynchronous replication may lose data

Cloud-Native CockroachDB Add nodes with automatic rebalancing Any node can serve any role Automatic failover with consistent replication

CockroachDB on DC/OS

DC/OS Provides an Ideal Environment Elastic allocation and scheduling Discovery, VIPs, and load balancing Weakly persistent local storage

Storage Dilemma RAID or network-attached storage is expensive CockroachDB supplies its own redundancy Naive container deployments could lose all local volumes at once DC/OS SDK provides persistent scheduling

DC/OS Package dcos package install cockroachdb Starts a 3-node cluster Add more later by changing NODE_COUNT variable Or one-click install through the web

Why CockroachDB? An open source SQL database for global cloud services.

Why CockroachDB? Distributed SQL Data integrity at global scale Multi-active availability Consistent transactions

Distributed SQL A database that grows with your application

What is Distributed SQL? SQL with ACID semantics Runs across multiple database servers Distributed storage Distributed compute

Distributed Storage Tables scale to any size No manual partitioning necessary Automatic rebalancing as machines are added

Distributed Execution Nodes with data compute their portion of results SELECT SUM(duration) FROM sessions

Flexible Scheduling Schedule CockroachDB nodes......near your application servers for lower latency...or on your application nodes when you have underutilized disks

Global Distribution Span three datacenters to survive major outages Put data close to your customers

Multi-Active Availability Survive disasters

Disaster Recovery, Version 1.0 DC1 DC2 microservices{ Remote Backup

Recovery 2.0: Active/Passive DC1 DC2 async Primary Secondary

Recovery 3.0: Active/Active DC1 DC2

Multi-Active Availability DC1 DC2 consensus consensus consensus DC3

Consistent Transactions Data Integrity at Global Scale

Consensus Replication A majority of replicas must agree to commit Failovers never lose committed data Raft consensus algorithm

Distributed Transactions Transactions can span rows, tables, even databases ACID semantics Serializable isolation

Demo

Status

Current Version: 1.0.6 First production-ready release in May 2017 Distributed SQL Multi-active availability Enterprise: Incremental, distributed backup/restore

Very Soon: 1.1 Ruggedization Diagnostic & debugging tools Performance SQL coverage Fast data import (CSV)

Up Next Performance SQL feature parity (JSONB) Change data capture (via Kafka) Global data architecture Enterprise: row-level partitioning

The End https://www.cockroachlabs.com https://github.com/cockroachdb/cockroach https://github.com/dcos/examples https://gitter.im/cockroachdb/cockroach

Appendix

Target Customers Is CockroachDB right for you?

Fastest growing companies in the European Gaming Company world are using CockroachDB to solve their global data challenges $2B+ gaming company that uses CockroachDB to improve performance and availability for their global customer base. $60B+ web services company that uses CockroachDB to power their knowledge graph Real-time gaming platform that uses CockroachDB as the enabling technology to deliver their PaaS Proprietary & Confidential

Is CockroachDB right for you? Global, scalable cloud services We hope this eventually means everyone Not for: Latency-sensitive writes Unstructured data Efficient OLAP processing (for now) Analytics in a non-sql language (mapreduce)

Correctness or Performance? We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions Authors of Google s Spanner paper

CockroachDB Internals Thorax, wings, eyes, legs,...

Layered architecture SQL Transactional KV Distribution Replication Storage

Data Distribution SQL Transactional KV Distribution Replication Storage

Data Distribution Data is a large collection of Key:Value pairs Data split into Ranges of Key:Value pairs Ranges are ~64 MB of data CockroachDB/Bigtable/HBase/Spanner

Data Distribution Options for finding data Hashing Order-Preserving

Order-preserving data distribution Pro: efficient scans over total ordering of data Con: requires additional indexing

Keyspace as Monolithic, Sorted Map

Ranges Form the Keyspace

Ranges Map to Nodes

Data Lookup

Consensus SQL Transactional KV Distribution Replication Storage

Consensus Replication Replicate to N (where N >= 3) nodes Commit happens when a quorum have written the data CockroachDB/Etcd/Spanner/Aurora/Zookeeper/...

Consensus Replication Raft is our consensus protocol of choice. Run one consensus group per range of data

Consensus Replication Put cherry Leader apricot banana blueberry cherry grape Follower apricot banana blueberry Follower apricot banana blueberry grape grape

Consensus Replication Put cherry Leader apricot Put cherry banana blueberry cherry grape Follower apricot banana blueberry Follower apricot banana blueberry grape grape

Consensus Replication Put cherry Leader apricot Put cherry banana blueberry cherry grape Write committed when 2 out of 3 nodes have written data Follower apricot banana blueberry cherry grape Follower apricot banana blueberry grape

Consensus Replication Put cherry Leader Put cherry apricot Ack banana blueberry cherry grape Ack Follower apricot banana blueberry Follower apricot banana blueberry cherry grape grape

Consensus Replication Put cherry Leader apricot Put cherry banana blueberry cherry grape Ack Follower apricot banana blueberry Follower apricot banana blueberry cherry cherry grape grape

Consensus Replication What happens during failover?

Consensus Replication: Failover Put cherry Leader apricot banana blueberry cherry grape Follower apricot banana blueberry Follower apricot banana blueberry grape grape

Consensus Replication: Failover Put cherry Leader Elect a new Leader apricot banana blueberry cherry grape Follower apricot banana blueberry Follower apricot banana blueberry grape grape

Consensus Replication: Failover Follower apricot Leader apricot banana blueberry On failover, only data written to a quorum is considered present Follower apricot banana grape banana blueberry blueberry cherry grape grape

Consensus Replication: Failover Read cherry Key not found Follower apricot Leader apricot banana blueberry Follower apricot banana grape banana blueberry blueberry cherry grape grape

Consensus Replication Consensus provides atomic writes But only for each range What about operations that hit multiple ranges?

Distributed Transactions SQL Transactional KV Distribution Replication Storage

Transactions Supports ACID semantics All-or-nothing Serializable [default] & Snapshot isolation levels

Distributed Transactions Transaction record Keyed by a transaction UUID Located on first written range Atomic commit or abort via write to txn record One phase commit fast path

Distributed Transactions 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) grape

Distributed Transactions 1. Begin Txn 1 Put cherry 2. Put mango Txn-1 Record Status: PENDING apricot cherry (txn 1) grape lemon mango (txn 1) orange

Distributed Transactions 1. Begin Txn 1 Put cherry 2. Put mango 3. Commit Txn 1 Txn-1 Record Status: COMMITTED apricot cherry (txn 1) grape lemon mango (txn 1) orange

Distributed Transactions 1. Begin Txn 1 Put cherry 2. Put mango 3. Commit Txn 1 4. Clean up intents Txn-1 Record Status: COMMITTED apricot cherry grape lemon mango orange

Distributed Transactions 1. Begin Txn 1 Put cherry 2. Put mango 3. Commit Txn 1 4. Clean up intents 5. Remove Txn 1 apricot cherry grape lemon mango orange

Distributed Transactions That s the happy case What about conflicting transactions?

Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) grape

Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) 1. Begin Txn 2 2. Get cherry grape

Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) 1. Begin Txn 2 2. Get cherry Check Txn-1 record grape

Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) 1. Begin Txn 2 2. Get cherry Check Txn-1 record Wait while PENDING grape

Distributed Transactions (read conflict) 1. Begin Txn 1 Put cherry 2. Commit Txn1 Txn-1 Record Status: COMMITTED apricot cherry (txn 1) 1. Begin Txn 2 2. Get cherry Check Txn-1 record Wait while PENDING grape

Distributed Transactions What about write / write conflicts?

Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING apricot cherry (txn 1) grape

Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING Txn-2 Record Status: PENDING 1. Begin Txn 2 2. Put cherry apricot cherry (txn 1) grape

Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING Txn-2 Record Status: PENDING 1. Begin Txn 2 2. Put cherry Check Txn 1 record apricot cherry (txn 1) grape

Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry Txn-1 Record Status: PENDING Txn-2 Record Status: PENDING 1. Begin Txn 2 2. Put cherry Check Txn 1 record Wait while PENDING apricot cherry (txn 1) grape

Distributed Transactions (write conflict) 1. Begin Txn 1 Put cherry 2. Commit Txn 1 Txn-1 Record Status: COMMITTED Txn-2 Record Status: PENDING 1. Begin Txn 2 2. Put cherry Check Txn 1 record Wait while PENDING apricot cherry (txn 1) grape

Click to edit title Click to edit text Second level Third level Fourth level» Fifth level