Architecture of a Real-Time Operational DBMS

Similar documents
Flash-Optimized, High-Performance NoSQL Database for All

VOLTDB + HP VERTICA. page

Have your cake, and eat it too. Strong Consistency and High Performance

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Aerospike Scales with Google Cloud Platform

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Oracle Exadata X7. Uwe Kirchhoff Oracle ACS - Delivery Senior Principal Service Delivery Engineer

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

VoltDB for Financial Services Technical Overview

VoltDB vs. Redis Benchmark

MySQL Cluster Web Scalability, % Availability. Andrew

When, Where & Why to Use NoSQL?

Conceptual Modeling on Tencent s Distributed Database Systems. Pan Anqun, Wang Xiaoyu, Li Haixiang Tencent Inc.

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

CISC 7610 Lecture 2b The beginnings of NoSQL

Introduction to Database Services

10. Replication. Motivation

Next-Generation Cloud Platform

Oracle TimesTen Scaleout: Revolutionizing In-Memory Transaction Processing

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Oracle NoSQL Database Enterprise Edition, Version 18.1

4 Myths about in-memory databases busted

Memory-Based Cloud Architectures

BENCHMARK: PRELIMINARY RESULTS! JUNE 25, 2014!

Improving Altibase Performance with Solarflare 10GbE Server Adapters and OpenOnload

MySQL Cluster for Real Time, HA Services

DATABASE SCALE WITHOUT LIMITS ON AWS

GridGain and Apache Ignite In-Memory Performance with Durability of Disk

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Oracle Exadata: Strategy and Roadmap

MapR Enterprise Hadoop

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

CS 655 Advanced Topics in Distributed Systems

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

Distributed File Systems II

HP NonStop Database Solution

Tools for Social Networking Infrastructures

Crescando: Predictable Performance for Unpredictable Workloads

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Migrating Oracle Databases To Cassandra

Introduction to Oracle NoSQL Database

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Oracle NoSQL Database Enterprise Edition, Version 18.1

TITLE. the IT Landscape

How do we build TiDB. a Distributed, Consistent, Scalable, SQL Database

Maximizing Fraud Prevention Through Disruptive Architectures Delivering speed at scale.

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Table of contents. OpenVMS scalability with Oracle Rdb. Scalability achieved through performance tuning.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.

The Google File System. Alexandru Costan

NoSQL BENCHMARKING AND TUNING. Nachiket Kate Santosh Kangane Ankit Lakhotia Persistent Systems Ltd. Pune, India

A Non-Relational Storage Analysis

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

NVMe SSDs Future-proof Apache Cassandra

Changing Requirements for Distributed File Systems in Cloud Storage

VEXATA FOR ORACLE. Digital Business Demands Performance and Scale. Solution Brief

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

State of the Dolphin Developing new Apps in MySQL 8

IBM System Storage DCS3700

Exadata Implementation Strategy

Pivot3 Acuity with Microsoft SQL Server Reference Architecture

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

MySQL High Availability. Michael Messina Senior Managing Consultant, Rolta-AdvizeX /

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

MySQL & NoSQL: The Best of Both Worlds

Course Content MongoDB

<Insert Picture Here> MySQL Web Reference Architectures Building Massively Scalable Web Infrastructure

Achieving the Potential of a Fully Distributed Storage System

A Gentle Introduction to Ceph

DataON and Intel Select Hyper-Converged Infrastructure (HCI) Maximizes IOPS Performance for Windows Server Software-Defined Storage

Cloud Computing with FPGA-based NVMe SSDs

NoSQL Performance Test

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

São Paulo. August,

DEMYSTIFYING BIG DATA WITH RIAK USE CASES. Martin Schneider Basho Technologies!

April 21, 2017 Revision GridDB Reliability and Robustness

NewSQL Databases. The reference Big Data stack

Spanner: Google's Globally-Distributed Database. Presented by Maciej Swiech

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

5 Fundamental Strategies for Building a Data-centered Data Center

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III

An Intelligent & Optimized Way to Access Flash Storage Increase Performance & Scalability of Your Applications

Performance Benefits of Running RocksDB on Samsung NVMe SSDs

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

RIGHTNOW A C E

Drilling Through The Stack. Burkhard Neidecker-Lutz Technical Director, SAP Research SAP AG

CA485 Ray Walshe Google File System

Windows Servers In Microsoft Azure

CIT 668: System Architecture. Distributed Databases

Introduction to Distributed Data Systems

SCYLLA: NoSQL at Ludicrous Speed. 主讲人 :ScyllaDB 软件工程师贺俊

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

Performance comparisons and trade-offs for various MySQL replication schemes

Transcription:

Architecture of a Real-Time Operational DBMS Srini V. Srinivasan Founder, Chief Development Officer Aerospike CMG India Keynote Thane December 3, 2016 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 1 ]

Real-Time Workloads [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 2 ]

Reliability at Massive Scale Developments Internet growth High rate of transactions Millions/second Storage advances expansion of DRAM, rise of SSDs New distributed consensus algorithms e.g., CAP Theorem, Paxos Application developers prefer not using SQL Python, Go, PHP, Traditional DB Guarantee strong consistency to replicated data Limited in scalability and availability Cannot handle network partitions NoSQL DB Focus on massive scalability and high availability Use concepts from Operating Systems and Distributed Systems [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 3 ]

SQL è NoSQL SQL databases are architected for Disk oriented storage and indexing structures Multithreading to hide latency Locking-based concurrency control mechanisms Log-based recovery NoSQL or NewSQL databases are architected for In-Memory Incremental upgrades (no fork-lift upgrade) High Availability Self-managing (self-healing, self-maintaining, self-tuning) [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 4 ]

Next Generation of Database Systems Speed at Scale Many Choices Scaling up affordably TCO Restricted Functionality Real-time Problem Set Observed TCO Speed TPS Rich functionality nonreal-time use cases TCO ($) Desired TCO Scale TB Scale TB TCO - Total cost of ownership Goal: Deliver Predictable Performance, Highest Availability, and Lowest TCO [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 5 ]

Use Cases [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 6 ]

Billion Dollar Advertising Market uses RTB 1 to 6 billion cookies tracked Auctions at about 3.0M / sec in North America 100ms ad rendering, 50ms real-time bidding Low Latency, High Throughput, High Uptime [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 7 ]

RTB Tech Stack [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 8 ]

Beyond Ad Tech FinServ Marketing Tech Telco AdTech Gaming [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 9 ]

Financial Services Tech Stack [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 10 ]

Fraud Detection Tech Stack [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 11 ]

Telco Tech Stack [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 12 ]

Operational Scale in Enterprises Decisioning Engine BUSINESS TRANSACTIONS LEGACY RDBMS HDFS BASED Web views ( Payments ) ( Mobile Queries ) ( Recommendation ) ( And More ) High Performance NoSQL XDR REAL-TIME BIG DATA DECISIONING 500 Business Trans per sec LEGACY DATABASE (Mainframe) DATA WAREHOUSE/ DATA LAKE X 5000 = 2.5 M Calculations per sec Database Transactions per sec [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 13 ]

Technology [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 14 ]

Architecture Overview 1) No Hotspots Distributed Hashing simplifies data partitioning 2) Smart Client 1 hop to data, load balancing 3) Shared Nothing Architecture every node is identical 4) Smart Clustering auto-sharding, auto-failover, auto-rebalancing, rack aware, rolling upgrades 5) Transactions and long-running tasks prioritized in realtime 6) XDR sync replication across data centers ensures near Zero Downtime [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 15 ]

Cluster Formation Say N1 is seed node and N3 is the paxos principal N2 and N3 send themselves in list to N1; N1 discovers them N1 sends adjacency list [N1, N2, N3] to newly discovered node N3 (and also N2) N3 discovers N2 and starts sending the cluster node list [N3, N2, N1] to N1 and N2 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 16 ]

Distributed Hash Based Partitioning Distributed Hashing with No Hotspots Every key hashed with RIPEMD160 into an ultra efficient 20 byte (fixed length) string Hash + additional (fixed 64 bytes) data forms index entry in RAM Some bits from hash value are used to calculate the Partition ID (4096 partitions) Partition ID maps to Node ID in the cluster [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 17 ]

Data Distribution Index and data are colocated 1. Distribute workload uniformly 2. Provide predictable read/write performance 3. Scale up and down by simply adding cluster nodes 4. Rebalance data non-disruptively and efficiently Partition assignment objectives 1. Deterministic, so each node can operate by itself 2. Uniform distribution of partitions across nodes 3. Minimize partition moves during cluster changes [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 18 ]

Partition Assignment Algorithm function REPLICATION_LIST_ASSIGN(partitionid) node_hash = empty map for nodeid in succession_list: node_hash[nodeid] = NODE_HASH_COMPUTE(nodeid, partitionid) replication_list = sort_ascending(node_hash using hash) return replication_list function NODE_HASH_COMPUTE(nodeid, partitionid): nodeid_hash = fnv_1a_hash(nodeid) partition_hash = fnv_1a_hash(partitionid) return jenkins_one_at_a_time_hash(<nodeid_hash, partition_hash>) [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 19 ]

Real-Time Prioritization master replica Writing with Immediate Consistency Adding a Node transactions continue 1. Write sent to row master 2. Latch against simultaneous writes 3. Apply write to master and replica synchronously 4. Queue operations to disk 5. Signal completed transaction 6. Master merges duplicate copies (if any) 1. Cluster discovers new node via gossip protocol 2. Paxos vote determines new data organization 3. Partition migrations scheduled (only deltas copied) 4. When a partition migration starts, write journal starts on destination 5. Partition moves atomically 6. Journal is applied and source data deleted [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 20 ]

Intelligent Client The Aerospike Client is implemented as a library, JAR or DLL, and consists of 2 parts: Operation APIs These are the operations that you can execute on the cluster CRUD+ etc. First class observer of the Cluster Monitoring the state of each node and aware of new nodes or node failures. 1 Hop to data Smart Client simply calculates Partition ID to determine Node ID Client performs load balancing [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 21 ]

Designed for Wire-Line Speed Multi-core architecture Optimized C based DB kernel 1. Multi-threaded data structures 2. Nested locking model for synchronization 3. Lockless data structures 4. Partitioned single threaded data structures 5. Index entries are aligned to cache line (64 bytes) 6. Custom memory management (arenas) Memory Arena Assignment [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 22 ]

In-Memory and Hybrid-Memory Storage Architecture 1. Direct device access 2. Large Block Writes 3. Indexes in DRAM 4. Highly Parallelized Highlights 5. Log-structured FS copy-on-write 6. Fast restart with shared memory Storage Layout [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 23 ]

Benchmarks [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 24 ]

Hybrid-Memory Performance HIGH THROUGHPUT LOW LATENCY 350,000 300,000 Balanced Workload Read Latency Throughput, ops/second 250,000 200,000 150,000 100,000 50,000 0 Balanced Read-Heavy Average Latency, ms 10 7.5 5 2.5 0 0 50,000 100,000 150,000 200,000 Throughput, ops/sec Aerospike Cassandra Aerospike Balanced 50/50 read-write ratio Read-Heavy 95/5 read-write ratio [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 25 ]

In-Memory performance YCSB Benchmark 50 million records YCSB Workload A (50/50 R/W) YCSB Workload B (95/5 R/W) Zipfian key distribution 8 Core Dual Socket Intel Xeon CPU E5-2665@2.4GHz 32GB DRAM with 16 queues [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 26 ]

DRAM Vs SSD on GCE GCE instance n1-standard-8 10 node cluster 150 byte record with 3 columns 100 million records [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 27 ]

Predictable Performance During Failures 1 2 3 4 5 Phases 1) 100KTPS 4 nodes 2) Clients at Max 3) 400KTPS 4 nodes 4) 400KTPS 3 nodes 5) 400KTPS 4 nodes Aerospike Node Specs: CentOS 6.3 Intel i5-2400@ 3.1 GHz (Quad core) 16 GB RAM@1333 MHz [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 28 ]

TCO: In-Memory Vs Hybrid-Memory Actual deployment analysis. Deployment requires 500K TPS, 10 TB of storage, with 2x replication factor. In-Memory System 186 SERVERS REQUIRED Hybrid-Memory System ONLY 14 SERVERS REQUIRED 14 SERVERS REQUIRED Storage per server 180 GB (196 GB Server) 2.4 TB (4 x 700 GB) TPS per cluster 500,000 500,000 Cost per server $8,000 $11,000 Server costs $1,488,000 $154,000 Power/server 0.9 kw 1.1 kw Power (2 years) $0.12 per kwh ave. US $352,000 $32,400 Maintenance (2 years) $3,600 per server $670,000 $50,400 Total $2,510,000 $236,800 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 29 ]

Future Work Software Application Requirements New Hardware Linearize with CP Mode Eventual consistency with conflict detection and resolution Pipelined execution of client transactions for increased performance Security enhancements Customers demand Real-time decisions based on recent data High Consistency Security 3D XPoint High core CPUs NVMe Multi-queue network cards Virtualized IO 30 [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 30 ]

Thank You Questions? [ CMGI Keynote, Thane, India. 2016 Aerospike Inc. All rights reserved. 31 ]