How Tencent uses PGXZ(PGXC) in WeChat payment system. jason( 李跃森 )

Similar documents
Postgres-XC Postgres Open Michael PAQUIER 2011/09/16

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

Postgres-XC PG session #3. Michael PAQUIER Paris, 2012/02/02

Distributed File Systems II

Postgres-XC PostgreSQL Conference Michael PAQUIER Tokyo, 2012/02/24

Logical Decoding : - Amit Khandekar. Replicate or do anything you want EnterpriseDB Corporation. All rights reserved. 1

A Fast and High Throughput SQL Query System for Big Data

The Future of Postgres Sharding

CSE 344 Final Review. August 16 th

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

There And Back Again

Main-Memory Databases 1 / 25

ColumnStore Indexes. מה חדש ב- 2014?SQL Server.

Use Cases for Partitioning. Bill Karwin Percona, Inc

Postgres-XC Architecture, Implementation and Evaluation Version 0.900

Independent consultant. Oracle ACE Director. Member of OakTable Network. Available for consulting In-house workshops. Performance Troubleshooting

Distributed Systems. GFS / HDFS / Spanner

NewSQL Database for New Real-time Applications

PostgreSQL what's new

Migrating Oracle Databases To Cassandra

9/26/2017 Sangmi Lee Pallickara Week 6- A. CS535 Big Data Fall 2017 Colorado State University

Azure-persistence MARTIN MUDRA

CS November 2018

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

CS November 2017

Independent consultant. Oracle ACE Director. Member of OakTable Network. Available for consulting In-house workshops. Performance Troubleshooting

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

LazyBase: Trading freshness and performance in a scalable database

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

HYRISE In-Memory Storage Engine

Huge market -- essentially all high performance databases work this way

High-Performance Distributed DBMS for Analytics

PostgreSQL Replication 2.0

Course 6231A: Maintaining a Microsoft SQL Server 2008 Database

Greenplum Architecture Class Outline

Maintaining a Microsoft SQL Server 2005 Database Course 2780: Three days; Instructor-Led

MySQL High Availability Solutions. Alex Poritskiy Percona

Course 6231A: Maintaining a Microsoft SQL Server 2008 Database

Google File System. Arun Sundaram Operating Systems

CISC 7610 Lecture 2b The beginnings of NoSQL

CPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University

Postgres-XC Dynamic Cluster Management

Importing and Exporting Data Between Hadoop and MySQL

Scalability of web applications

18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.

management systems Elena Baralis, Silvia Chiusano Politecnico di Torino Pag. 1 Distributed architectures Distributed Database Management Systems

PostgreSQL Cluster. Mar.16th, Postgres XC Write Scalable Cluster

Big Data and Scripting map reduce in Hadoop

VOLTDB + HP VERTICA. page

Ghislain Fourny. Big Data 5. Wide column stores

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Making Non-Distributed Databases, Distributed. Ioannis Papapanagiotou, PhD Shailesh Birari

Synergetics-Standard-SQL Server 2012-DBA-7 day Contents

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

Microsoft SQL Server" 2008 ADMINISTRATION. for ORACLE9 DBAs

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

Unifying Big Data Workloads in Apache Spark

Advanced Databases: Parallel Databases A.Poulovassilis

CSE 190D Spring 2017 Final Exam Answers

Maintaining a Microsoft SQL Server 2008 Database (Course 6231A)

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

PostgreSQL Built-in Sharding:

MySQL Architecture Design Patterns for Performance, Scalability, and Availability

How is information stored?

Designing Database Solutions for Microsoft SQL Server (465)

Chapter 2 CommVault Data Management Concepts

CLOUD-SCALE FILE SYSTEMS

Database Architectures

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Performance Test Results for ScaleArc for MySQL on Aurora RDS Nov ScaleArc. All Rights Reserved. 1

Distributed Database Management Systems. Data and computation are distributed over different machines Different levels of complexity

Distributed KIDS Labs 1

The Google File System. Alexandru Costan

The Evolving Apache Hadoop Ecosystem What it means for Storage Industry

Oracle NoSQL Database at OOW 2017

GFS: The Google File System. Dr. Yingwu Zhu

Fast Big Data Analytics with Spark on Tachyon

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Ghislain Fourny. Big Data 5. Column stores

Microsoft SQL Server Database Administration

Data Protection Modernization: Meeting the Challenges of a Changing IT Landscape

Crescando: Predictable Performance for Unpredictable Workloads

<Insert Picture Here> DBA Best Practices: A Primer on Managing Oracle Databases

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Georgia Institute of Technology ECE6102 4/20/2009 David Colvin, Jimmy Vuong

SQL Server 2014 Training. Prepared By: Qasim Nadeem

Chapter 17: Parallel Databases

Oracle Optimizer: What s New in Oracle Database 12c? Maria Colgan Master Product Manager

The Google File System

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Transcription:

How Tencent uses PGXZ(PGXC) in WeChat payment system jason( 李跃森 ) jasonysli@tencent.com

About Tencent and Wechat Payment Tencent, one of the biggest internet companies in China. Wechat, the most popular social network APP in China. Wechat Payment, wireless, fast, secure, efficiency payment solution provided by Wechat. 2017/4/6

Why PGXZ(PGXC) size is HUGE(30TB per year and increasing): database cluster SQL is more flexible and popular(compared to key value and other solution) Complex query(join ACROSS multi-datanode) Distributed transaction(easy to use) High transaction throughoutput(peak 1.5 Million per min) Capability of horizontal scale out(read and Write workload) 3

PGXZ Architecture: 5 aspects App Interface: SQL Business Model: OLTP Coordinator Coordinator GTM (Master) (Master) (Master) (Master) Complex query: Join over multinodes Distributed Transaction (Slave) (Slave) (Slave) (Slave) MPP:Share Nothing 4

PGXZ Architecture: Coordinator: 1. Access entrance of the cluster App 2. Sharding data 3. Routing shard Coordinator Coordinator 4. Process tuples of transaction over multi-nodes 5. Only catalog, no user data : GTM 1. Store application data 2. DML ops against data resided (Master ) (Master ) (Master ) (Master ) in this node 3. Replication GTM: 1. Manage transaction traffic (Slave) (Slave) (Slave) (Slave) Schema: 1. One copy for every node 2. Every coordinator has the same global catalog info. 5

Online add/remove node-- Sharding PGXC PGXZ rows rows Hash(row,#s) Hash(row) Logic Shard Map hash_value = Hash(row) nodeid= hash_value % #nodes MUST rehash all data of whole cluster when add/remove datanode shardid = hash(row) nodeid = map(shardid) routing is decoupled from the number of data nodes by adding Logical Shard Map 6

Online Scale out--steps Coordinator task1 monitor task2 (new) slave slave slave Monitor: 1. Add data node to cluster 2. Make a PLAN of moving data to new data node 1. A PLAN consists of several data moving TASKS 2. One data moving TASK can move several shards 3. One data moving TASK is a atomic operation 7

Online add/remove node- Moving Dump stock data from source data node Restore stock data to target data node 1. Start to dump stock data with a snapshot 2. Start logical decoding with same snapshot Ensure there is no overlap or miss between stock data and incremental data. 1. Restore stock data to target data node Get incremental data from source data node Restore incremental data to target data node 1. Get incremental data from source data node Ensure there is no overlap or miss between two iterations of logical decoding(incremental data) 1. Restore Incremental data to target data node 2. Go to next iteration If rows written at this iteration is less than a threshold, then go to update routing info Update routing info 1. Block writing at moving data(shard/partition), 2. Update routing info at all of coordinators, 3. Release blocking (20ms ~ 30ms) 8

Multi-Groups: Handle huge merchants( Skew) App App CN CN CN CN GTM Small Key Group Big Key group Small Key Group: belongs to one merchant resides at only one data node. No distribute transaction when writing one small merchant. No join across multiple data nodes when query data from one small merchant. Big Key Group: Scatter One merchant data to all of data nodes inside the group. One merchant data generated in same day reside at same data node. No matter how big the merchant is, it can be stored in this cluster. 9

Multi-Groups: data stored separately by access frequency( Cost Saving) Coordinator 2016/06 2016/06 2016/05 2016/05 2016/04 2016/04 2016/03 2016/03 2016/02 2016/02 2016/01 2016/01 Hot data group High Frequency data High Performance Hardware Cold data group Low Frequency data High Volume Hardware 10

Multi-Groups: Routing to 4 groups Coordinator Hot/Cold Hot Cold create_time Small/Big Big Key list Big Key List merchant_id NO YES NO YES Shard routing shard map shard map shard map shard map shardid Small Key Group(Hot) Big Key Group(Hot) Small Key Group(Cold) Big Key Group(Cold) 11

Plan optimization--huge Merchant order query on M+0 M+1 M+2 M+3 M+N 90 million rows Coordinator Huge merchants group SQL:select * from t_trade_ref where (ftotal_amount between 1 and 683771 and fcreate_time >= '2015-07-31 00:00:00' AND fcreate_time < '2015-08-30 00:00:00' and fmerchant_id = 7777777) ORDER BY ffinish_time DESC LIMIT 10 offset x; Table partitioned by month on the created time of row Create multi-column index on fmerchant_id and ffinish_time. ffinish_time is used for order by.

Plan Optimization Plan detail Coordinator ORDER BY ffinish_time DESC LIMIT 10 offset x; Merge Append Huge merchants group ORDER BY ffinish_time DESC LIMIT 10 offset x; ORDER BY ffinish_time DESC LIMIT 10 offset x; M+0 Index scan Index scan M+1 M+2 Index scan Index scan Merge Append Index scan Index scan Merge Append M+N Index scan Index scan CN pushes down limit offset. uses index scan for each partition. combine multi partitions by Merge Append. CN combines multi datanodes by Merge append.

Plan optimization-performance result 6000 5000 4000 TPS3000 2000 1000 160 140 120 100 80 60 40 20 0 0 1233 3587 QPS 5631 5652 5429 32 100 200 400 800 SESSION NUBER 23 26 latency (MS) 35 72 145 32 100 200 400 800 SESSION NUBER Finish query within less than 50ms. QPS can reach up to 5 thousand.

Load Balance:Use standby datanodes to provide read only service Read&Write node1 node1 standby Read Coordinator Hot Standby Coordinator node2 Separated kind of coordinators node2 standby Read only coordinators catalog have entries of hotstandby datanodes Master datanode ships log to standby in hot standby mode

Load Balance:Use standby datanode to archive xlog HDFS Archive Xlog node1 node1 standby High work load on master datanode Use standby datanode to do the archive job

Online Table reorginize:apend only heap mode INSERT FREESPACE MAP APPEND ONLY MODE STOCK Tuple NEW Tuple HEAP HEAP Append tuple to the end of Heap, so we can know the new coming tuple Append only mode can switch on/off

Online Table reorginize:shared CTID pipeline General Postgres(DML) Write General Postgres(DML) VACUUM FULL CON Read CTID CTID Use shared pipeline to transfer tuple modification to VACUUM FULL process General postgres write to the shared pipeline Vacuum full process read citd info from the pipeline

Online Table reorginize:detail steps BEGIN Copy stock data item by item Build new heap Rebuild new table index Build indexes of new heap Swith on Append only Heap Copy Apply new change Swap table relfilenode Swap index relfilenode Drop old table and index End change small enough yes BLOCK DML Apply last change NO

High Availability--Disaster redundancy across cities and IDCS Shenzhen Shanghai IDC0 IDC1 IDC2 ZKServer0 ZKServer1 ZKServer2 JCenter0 Slave JCenter1 Master JCenter2 Slave CAgent0 CAgent1 CAgent2 GTM CN CN CN M0 M1 S1_1 S0_1 S0_2 S1_2

PGXZ @WeChat Payment s: 2 GTMs(master-slave) 7 Coordinators 31 pairs of s 17 pairs for small key group(hot) 8 pairs for huge key group(hot) 3 pairs for small key group(cold) 3 pairs for huge key group(cold) 2017/4/6 21

Next step job Upgrade PGXZ(PG 9.3.5) to 9.6.X(Or may rebase on PGXL) Try to contribute Vacuum full concurrently

My team

24