How we build TiDB. Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

Similar documents
Shen PingCAP 2017

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

TiDB: NewSQL over HBase.

NewSQL Databases. The reference Big Data stack

Distributed PostgreSQL with YugaByte DB

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Integrity in Distributed Databases

CockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

Distributed Systems. 19. Spanner. Paul Krzyzanowski. Rutgers University. Fall 2017

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Datacenter replication solution with quasardb

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Building Consistent Transactions with Inconsistent Replication

Database Architectures

Aurora, RDS, or On-Prem, Which is right for you

What s new in Mongo 4.0. Vinicius Grippa Percona

Ghislain Fourny. Big Data 5. Wide column stores

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

MySQL Database Administrator Training NIIT, Gurgaon India 31 August-10 September 2015

Distributed Computation Models

It also performs many parallelization operations like, data loading and query processing.

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 8

Spanner: Google's Globally-Distributed Database* Huu-Phuc Vo August 03, 2013

State of the Dolphin Developing new Apps in MySQL 8

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

CIB Session 12th NoSQL Databases Structures

DATABASE SCALE WITHOUT LIMITS ON AWS

Spanner: Google s Globally-Distributed Database. Wilson Hsieh representing a host of authors OSDI 2012

MySQL In the Cloud. Migration, Best Practices, High Availability, Scaling. Peter Zaitsev CEO Los Angeles MySQL Meetup June 12 th, 2017.

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Synergetics-Standard-SQL Server 2012-DBA-7 day Contents

Migrating Oracle Databases To Cassandra

Time Series Live 2017

Extreme Computing. NoSQL.

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Spanner : Google's Globally-Distributed Database. James Sedgwick and Kayhan Dursun

MongoDB and Mysql: Which one is a better fit for me? Room 204-2:20PM-3:10PM

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

SQL Saturday Jacksonville Aug 12, 2017

CISC 7610 Lecture 2b The beginnings of NoSQL

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

MySQL Architecture Design Patterns for Performance, Scalability, and Availability

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Amazon Aurora Relational databases reimagined.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Introduction to MySQL Cluster: Architecture and Use

10. Replication. Motivation

CO MySQL for Database Administrators

Course Content MongoDB

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

CMPT 354: Database System I. Lecture 11. Transaction Management

Applications of Paxos Algorithm

SQL Azure. Abhay Parekh Microsoft Corporation

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

A BigData Tour HDFS, Ceph and MapReduce

Developing SQL Databases (762)

MySQL Replication Options. Peter Zaitsev, CEO, Percona Moscow MySQL User Meetup Moscow,Russia

Introduction to MySQL InnoDB Cluster

Copyright 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 12

CPS 512 midterm exam #1, 10/7/2016

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

Distributed File Systems II

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Distributed KIDS Labs 1

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

MySQL High Availability

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Everything You Need to Know About MySQL Group Replication

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Oracle NoSQL Database Enterprise Edition, Version 18.1

Open Source Database Ecosystem in Peter Zaitsev 3 October 2016

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

Design Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013

Scalability of web applications

Intro Cassandra. Adelaide Big Data Meetup.

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

DrRobert N. M. Watson

Become a MongoDB Replica Set Expert in Under 5 Minutes:

Using Oracle NoSQL Database Workshop Ed 1

MySQL Cluster Web Scalability, % Availability. Andrew

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

MySQL for Database Administrators Ed 3.1

This tutorial will give you a quick start with Consul and make you comfortable with its various components.

Distributed Systems. 27. Engineering Distributed Systems. Paul Krzyzanowski. Rutgers University. Fall 2018

Oracle NoSQL Database Enterprise Edition, Version 18.1

Improvements in MySQL 5.5 and 5.6. Peter Zaitsev Percona Live NYC May 26,2011

High Availability- Disaster Recovery 101

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Percolator. Large-Scale Incremental Processing using Distributed Transactions and Notifications. D. Peng & F. Dabek

Transcription:

How we build TiDB Max Liu PingCAP Amsterdam, Netherlands October 5, 2016

About me Infrastructure engineer / CEO of PingCAP Working on open source projects: TiDB: https://github.com/pingcap/tidb TiKV: https://github.com/pingcap/tikv Email: liuqi@pingcap.com 2

Agenda Why another database? What to build? How to design? - The principles or the philosophy - The logical architecture - The alternatives How to develop? - The architecture - TiKV core technologies - TiDB core technologies How to test? 3

Why another database? RDBMS NoSQL NewSQL: F1 and Spanner 4

What to build? SQL Scalability ACID Transaction High Availability A Distributed, Consistent, Scalable, SQL Database. 5

How to design

How to design The design principles The logical architecture The alternatives 7

The design principles User-oriented - Disaster recovery - Easy to use - The community and ecosystem Loose coupling Applications MySQL Drivers(e.g. JDBC) MySQL Protocol SQL Layer TiDB RPC The alternatives Key-Value Layer TiKV 8

Disaster recovery Make sure we never lose any data! Multiple replicas are just not enough Still need to keep Binlog in both the SQL layer and the Key-Value layer Make sure we have a backup in case the entire cluster crashes 9

Easy to use No scary sharding key No partition No explicit local index/global index Making scale transparent 10

Cross-platform On-premise devices: like Raspberry Pi Container: Docker Cloud: public, private, hybrid 11

The community and ecosystem To engage, contribute and collaborate TiDB and MySQL Compatible with most of the MySQL drivers(odbc, JDBC) and SQL syntax Compatible with MySQL clients and ORM Compatible with MySQL management tools and bench tools - MySQLdump, mydumper, myloader - phpmyadmin, Navicat, Workbench... - Sysbench 12

The community and ecosystem To engage, contribute and collaborate TiKV and etcd Share the Raft implementation Peer review the Raft module 13

The community and ecosystem To engage, contribute and collaborate TiKV and RocksDB 14

The community and ecosystem To engage, contribute and collaborate TiKV and Namazu https://github.com/pingcap/tikv/issues/846 15

The community and ecosystem To engage, contribute and collaborate The Rust community: Prometheus Thanks to the Rust team, grpc, Prometheus! 16

The community and ecosystem To engage, contribute and collaborate Spark connector Why Spark connector? TiDB is great for small or medium queries, such as 10 million * 1 million filter Spark is better for complex queries with lots of data. 17

Loose coupling - the logical architecture Highly layered Raft for consistency and scalability No distributed file system TiDB MySQL Protocol Layer TiKV MySQL Key-Value Layer MySQL Protocol Server SQL Layer Transaction MVCC Raft Local Key-Value Storage (RocksDB) 18

The alternatives Atomic clocks or GPS clocks Distributed file system Paxos C++ Timestamp Allocator RocksDB Raft Go & Rust 19

Atomic clocks / GPS clocks VS TimeStamp Allocator Time is important. Real time is vital in distributed systems. Can we get real time? Clock drift. 20

Real time We can t get real time precisely because of clock drift, whether it s GPS or Atomic Clocks 21

Atomic clocks / GPS clocks VS TimeStamp Allocator Pros Easy to implement Don't depend on hardware Cons Latency is high for multiple DCs 22

Distributed file system VS RocksDB Pros Simple Fast Easy to tune Cons Not easy to work with k8s properly 23

Paxos VS Raft Pros Easy to understand and implement Widely used and well tested Cons Almost none 24

C++ VS Go & Rust Pros Go for fast development and concurrency Rust for quality and performance Cons Third-party libraries 25

How to develop

How to develop The architecture The core technologies TiKV written in Rust TiDB written in Go 27

The architecture TiKV TiDB MySQL clients KV API Transaction Coprocessor Placement Driver Load Balancer (Optional) MySQL Protocol MySQL Protocol MVCC TiDB SQL Layer TiDB SQL Layer Raft KV etcd KV API DistSQL API TiDB Server (Stateless) KV API DistSQL API TiDB Server (Stateless) RocksDB Pluggable Storage Engine (e.g. TiKV) 28

TiKV core technologies

TiKV core technologies A distributed Key-Value layer to store data TiKV software stack Placement Driver Raft Scale and failover Multi-Version Concurrency Control (MVCC) Transaction 30

TiKV software stack Client RPC RPC RPC RPC Placement Driver PD 1 Store 1 Region 1 Store 2 Region 1 Store 3 Region 2 Store 4 Region 1 PD 2 Region 3 Region 2 Region 5 Region 2 PD 3 Raft Group Region 5 Region 4 Region 4 Region 3 Region 3 Region 5 Region 4 TiKV node 1 TiKV node 2 TiKV node 3 TiKV node 4 31

Placement Driver The concept comes from Spanner Provide the God s view of the entire cluster Store the metadata Clients have cache of the placement information Placement Driver Maintain the replication constraint 3 replicas by default Raft Raft Data movement for balancing the workload It s a cluster too, of course Thanks to Raft Placement Driver Raft Placement Driver 32

Raft For scaling-out and failover/replication Data is organized in Regions The replicas of one region form a Raft group Multi-Raft Workload is distributed among multiple regions There could be millions of regions in one big cluster Once a region is too large, it will be split to two smaller regions. Just like a cell division 33

Scale-out (initial state) Node B Node D Node A Region 1* Region 2 Region 3 Region 1 Region 1 Region 2 Region 3 Node C Region 2 Region 3 34

Scale-out (balancing) Node B Node D Node A Region 1* Region 2 Region 3 Region 1^ Region 1 Region 2 Region 3 Node C Region 2 Region 3 Node E 1) Transfer the leadership from the replica of Region 1 on Node A to the replica on Node B 35

Scale-out (balancing) Node B Node D Node A Region 1 Region 2 Region 3 Region 1* Region 1 Region 2 Region 3 Node C Region 2 Region 3 Node E Region 1 2) Add a replica of Region 1 on Node E 36

Scale-out (balancing) Node B Node D Node A Region 2 Region 3 Region 1* Region 1 Region 2 Region 3 Node C Region 2 Region 3 Node E Region 1 3) Remove the replica of Region 1 from Node A 37

MVCC Each transaction sees a snapshot of the database at the beginning time of this transaction. Any changes made by this transaction will not seen by other transactions until the transaction is committed. Data is tagged with versions Key_version: value Lock-free snapshot reads 38

Transaction API txn := store.begin() // start a transaction txn.set([]byte("key1"), []byte("value1")) txn.set([]byte("key2"), []byte("value2")) err = txn.commit() // commit transaction if err!= nil { txn.rollback() } I want to write code like this! 39

Transaction model Inspired by Google Percolator, 2-phase commit 3 column families cf:lock: An uncommitted transaction is writing this cell; contains the location/pointer of primary lock cf: write: it stores the commit timestamp of the data cf: data: Stores the data itself 40

Transaction model Bob wants to transfer $7 to Joe Key Bal: Data Bal: Lock Bal: Write Bob 6: 5: $10 6: 5: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5: 41

Transaction model Key Bal: Data Bal: Lock Bal: Write Bob 7: $3 6: 5: $10 7: I am Primary 6: 5: 7: 6: data @ 5 5: Joe 6: 5: $2 6: 5: 6: data @ 5 5: 42

Transaction model Key Bal: Data Bal: Lock Bal: Write Bob 7: $3 6: 5: $10 7: I am Primary 6: 5: 7: 6: data @ 5 5: Joe 7: $9 6: 5: $2 7: Primary@Bob.bal 6: 5: 7: 6: data @ 5 5: 43

Transaction model Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $3 6: 5: $10 8: 7: I am Primary 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $9 6: 5: $2 8: 7: Primary@Bob.bal 6: 5: 8: data @ 7 7: 6: data @ 5 5: 44

Transaction model Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $6 6: 5: $10 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $6 6: 5: $2 8: 7:Primary@Bob.bal 6: 5: 8: data @ 7 7: 6: data @ 5 5: 45

Transaction model Key Bal: Data Bal: Lock Bal: Write Bob 8: 7: $6 6: 5: $10 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5: Joe 8: 7: $6 6: 5: $2 8: 7: 6: 5: 8: data @ 7 7: 6: data @ 5 5: 46

TiDB core technologies

TiDB core technologies A protocol layer that is compatible with MySQL Mapping table data to Key-Value store Predicate push-down Online DDL changes 48

Mapping table data to Key-Value store INSERT INTO user VALUES (1, bob, huang@pingcap.com ); INSERT INTO user VALUES (2, tom, tom@pingcap.com ); Key user/1 user/2 Value bob huang@pingcap.com tom tom@pingcap.com 49

Secondary index is necessary Global index All indexes in TiDB are transactional and fully consistent Stored as separate key-value pairs in TiKV Example: create index for user name Key Value() Key Value() bob_1 user/1 user/1 bob bob@pingcap.com tom_2 user/2 user/2 tom tom@pingcap.com 50

Predicate push-down TiDB Server age >20 and age <30 TiDB knows that Region 1/2/5 stores the data of the person table. Region 1 Region 2 Region 5 TiKV node 1 TiKV node 2 TiKV node 3 51

Schema changes Why online schema change is a must-have? Full data availability Minimal performance impact Example: add index for a column 52

Something the same as Google F1 The main features of TiDB that impact schema changes are: Distributed An instance of TiDB consists of many individual TiDB servers Relational schema Each TiDB server has a copy of a relational schema that describes tables, columns, indexes, and constraints. Any modification to the schema requires a distributed schema change to update all servers. Shared data storage All TiDB servers in all datacenters have access to all data stored in TiKV. There is no partitioning of data among TiDB servers. No global membership Because TiDB servers are stateless, there is no need for TiDB to implement a global membership protocol. This means there is no reliable mechanism to determine currently running TiDB servers, and explicit global synchronization is not possible. 53

Something different from Google F1 TiDB speaks MySQL protocol Statements inside of a single transaction cannot cross different TiDB servers http://static.googleusercontent.com/media/research.google.com/zh-cn//pubs/archive/41376.pdf 54

One more thing before schema change The big picture of SQL MySQL CLI MySQL CLI TiDB Node TiDB Node TiKV Cluster 55

Overview of a TiDB instance during a schema change TiDB Servers Schema 1 and Schema 2 MySQL CLI S1 MySQL CLI S2 TiKV Cluster 56

Schema change: Adding index Servers using different schema versions may corrupt the database if we are not careful. Consider a schema change from schema S1 to schema S2 that adds index I on table T. Assume two different servers, M1 and M2, execute the following sequence of operations: 1. Server M2, using schema S2, inserts a new row r to table T. Because S2 contains index I, server M2 also adds a new index entry corresponding to r to the key value store. 2. Server M1, using schema S1, deletes r. Because S1 does not contain I, M1 removes r from the key value store but fails to remove the corresponding index entry in I. 57

Schema states There are two states which we consider to be non-intermediate: absent and public There are two internal, intermediate states: delete-only and write-only Delete-only: A delete-only table, column, or index cannot have their key value pairs read by user transactions and 1. If E is a table or column, it can be modified only by the delete operations. 2. If E is an index, it is modified only by the delete and update operations. Moreover, the update operations can delete key value pairs corresponding to updated index keys, but they cannot create any new one. 58

Schema states There are two internal, intermediate states: delete-only and write-only Write-only: The write-only state is defined for columns and indexes as follows: A write-only column or index can have their key value pairs modified by the insert, delete, and update operations, but none of their pairs can be read by user transactions. 59

Schema change flow: Add index Delete-only Write-only Fill index Reorganization: MR job Public 60

TiDB: status of Adding index (delete-only) 61

TiDB: status of Adding index (add index) 62

How to test

How to test Test cases from community Lots of test cases in MySQL drivers/connectors Lots of ORMs Lots of applications Fault injection Hardware: disk error, network card, cpu, clock Software: file system, network and protocol Simulate Network: https://github.com/pingcap/tikv/blob/master/tests/raftstore/transport_simulate.rs Distribute testing Jepsen Namazu 64

The future GPS and Atomic clock Better query optimizer Better compatibility with MySQL Support the JSON and document type Push down more aggregation and built-in functions Using gprc instead of customizing the RPC implementation 65

Here we are! TiDB: https://github.com/pingcap/tidb TiKV: https://github.com/pingcap/tikv Email: liuqi@pingcap.com 66

Thank you! Questions?

Rate My Session! 68