Bringing code to the data: from MySQL to RocksDB for high volume searches

Similar documents
IoT Platform using Geode and ActiveMQ

Memory-Based Cloud Architectures

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

A (quick) retrospect. COMPSCI210 Recitation 22th Apr 2013 Vamsi Thummala

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

COMP Parallel Computing. Lecture 22 November 29, Datacenters and Large Scale Data Processing

1. HPC & I/O 2. BioPerl

MongoDB Storage Engine with RocksDB LSM Tree. Denis Protivenskii, Software Engineer, Percona

Mental models for modern program tuning

Effective Java Streams

RAID in Practice, Overview of Indexing

Bottleneck Hunters: How Schooner increased MySQL throughput by more than 800% Jeremy Cole

the road to cloud native applications Fabien Hermenier

MongoDB Schema Design for. David Murphy MongoDB Practice Manager - Percona

CS510 Operating System Foundations. Jonathan Walpole

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

Time-Series Data in MongoDB on a Budget. Peter Schwaller Senior Director Server Engineering, Percona Santa Clara, California April 23th 25th, 2018

Tools for Social Networking Infrastructures

Why Choose Percona Server for MongoDB? Tyler Duzan

Introduction to Database Services

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

MyRocks in MariaDB. Sergei Petrunia MariaDB Tampere Meetup June 2018

Scaling for Humongous amounts of data with MongoDB

Why Choose Percona Server For MySQL? Tyler Duzan

Operating Systems (2INC0) 2017/18

MongoDB Schema Design

MongoDB Revs You Up: What Storage Engine is Right for You?

Managing Storage: Above the Hardware

STORING DATA: DISK AND FILES

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Distributed File Systems II

Scaling MongoDB. Percona Webinar - Wed October 18th 11:00 AM PDT Adamo Tonete MongoDB Senior Service Technical Service Engineer.

10 Million Smart Meter Data with Apache HBase

CS 310: Memory Hierarchy and B-Trees

Virtual Storage Tier and Beyond

HashKV: Enabling Efficient Updates in KV Storage via Hashing

TokuDB vs RocksDB. What to choose between two write-optimized DB engines supported by Percona. George O. Lorch III Vlad Lesin

Physical Disk Structure. Physical Data Organization and Indexing. Pages and Blocks. Access Path. I/O Time to Access a Page. Disks.

Record Placement Based on Data Skew Using Solid State Drives

RocksDB Key-Value Store Optimized For Flash

4 Myths about in-memory databases busted

9 May Swifta. A performant Hadoop file system driver for Swift. Mengmeng Liu Andy Robb Ray Zhang

PRESENTATION TITLE GOES HERE

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Flash Storage Complementing a Data Lake for Real-Time Insight

Storage Speed and Human Behavior. PRESENTATION TITLE GOES HERE Eric Herzog CMO and Senior VP of Business Development Violin Memory

MongoDB Backup & Recovery Field Guide

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Which technology to choose in AWS?

MySQL In the Cloud. Migration, Best Practices, High Availability, Scaling. Peter Zaitsev CEO Los Angeles MySQL Meetup June 12 th, 2017.

File Management By : Kaushik Vaghani

Record Placement Based on Data Skew Using Solid State Drives

Building Durable Real-time Data Pipeline

Highway to Hell or Stairway to Cloud?

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

The Right Read Optimization is Actually Write Optimization. Leif Walsh

Migrating to Cassandra in the Cloud, the Netflix Way

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

Architekturen für die Cloud

Lecture 09. Spark for batch and streaming processing FREDERICK AYALA-GÓMEZ. CS-E4610 Modern Database Systems

MyRocks deployment at Facebook and Roadmaps. Yoshinori Matsunobu Production Engineer / MySQL Tech Lead, Facebook Feb/2018, #FOSDEM #mysqldevroom

FPGA Implementation of Erasure Codes in NVMe based JBOFs

CS November 2017

How Flash-Based Storage Performs on Real Applications Session 102-C

CSE 451: Operating Systems Spring Module 12 Secondary Storage

IaaS Vendor Comparison

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Monday, May 4, Discs RAID: Introduction Error detection and correction Error detection: Simple parity Error correction: Hamming Codes

Oracle on RAID. RAID in Practice, Overview of Indexing. High-end RAID Example, continued. Disks and Files: RAID in practice. Gluing RAIDs together

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

CS 261 Fall Mike Lam, Professor. Memory

Bigtable. Presenter: Yijun Hou, Yixiao Peng

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)


EECS 482 Introduction to Operating Systems

Highly Available Database Architectures in AWS. Santa Clara, California April 23th 25th, 2018 Mike Benshoof, Technical Account Manager, Percona

Operating Systems. Designed and Presented by Dr. Ayman Elshenawy Elsefy

Oracle Performance on M5000 with F20 Flash Cache. Benchmark Report September 2011

Distributed computing: index building and use

CSE 451: Operating Systems Spring Module 12 Secondary Storage. Steve Gribble

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. MySQL UC 2010 How Fractal Trees Work 1

Time Series Storage with Apache Kudu (incubating)

A Light-weight Compaction Tree to Reduce I/O Amplification toward Efficient Key-Value Stores

Most SQL Servers run on-premises. This one runs in the Cloud (too).

Couchbase Architecture Couchbase Inc. 1

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Chapter 12: Indexing and Hashing. Basic Concepts

Scalability of web applications

MySQL Performance Improvements

EVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Removing the I/O Bottleneck in Enterprise Storage

MC7204 OPERATING SYSTEMS

BMC Configuration Management (Marimba) Best Practices and Troubleshooting. Andy Santosa Senior Technical Support Analyst

The Unwritten Contract of Solid State Drives

Chapter 12: Indexing and Hashing

Transcription:

Bringing code to the data: from MySQL to RocksDB for high volume searches Percona Live 2016 Santa Clara, CA Ivan Kruglov Senior Developer ivan.kruglov@booking.com

Agenda Problem domain Evolution of search Architecture Results Conclusion

Problem domain

Search at Booking.com Input Where city, country, region When check-in date How long check-out date What search options (stars, price range, etc.) Result Available hotels

Inventory vs. Availability Inventory is what hotels give Booking.com hotel/room inventory Availability = search + inventory under which circumstances one can book this room and at what price Availability >>> Inventory

[Booking.com] works with approximately 800,000 partners, offering an average of 3 room types, 2+ rates, 30 different length of stays across 365 arrival days, which yields something north of 52 billion price points at any given time. http://www.forbes.com/sites/jonathansalembaskin/2015/09/24/booking-com-channels-its-inner-geek-towardengagement/#2dbc6f6326b2

Evolution of search

Normalized availability (pre 2011) classical LAMP stack P stands for Perl normalized availability write optimized dataset search request handled by single worker too much of computation complexity large cities become unsearchable

Pre-computed availability (2011+) materialized == de-normalized, flatten dataset aim for constant time fetch read (AV) and write (inv) optimized datasets

Pre-computed availability (2011+) materialized == de-normalized, flatten dataset aim for constant time fetch read (AV) and write (inv) optimized datasets single worker as inventory grows still have problems with big searches

Map-Reduced search (2014+) parallelized search multiple workers multiple MR phases search as service a distributed service with all good and bad sides

Map-Reduced search (2014+) parallelized search multiple workers multiple MR phases search as service a distributed service with all good and bad sides world search ~20s overheads IPC, serialization

Don't Bring the Data to the Code, Bring the Code to the Data L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes with Snappy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms Disk seek 10,000,000 ns 10 ms Read 1 MB sequentially from disk 20,000,000 ns 20 ms Send packet CA->Netherlands->CA 150,000,000 ns 150 ms https://gist.github.com/jboner/2841832

Don't Bring the Data to the Code, Bring the Code to the Data L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes with Snappy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms Disk seek 10,000,000 ns 10 ms Read 1 MB sequentially from disk 20,000,000 ns 20 ms Send packet CA->Netherlands->CA 150,000,000 ns 150 ms https://gist.github.com/jboner/2841832

Don't Bring the Data to the Code, Bring the Code to the Data L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 25 ns Main memory reference 100 ns Compress 1K bytes with Snappy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms Disk seek 10,000,000 ns 10 ms Read 1 MB sequentially from disk 20,000,000 ns 20 ms Send packet CA->Netherlands->CA 150,000,000 ns 150 ms https://gist.github.com/jboner/2841832

Map-Reduce + local AV (2015+) SmartAV smart availability combined MR search with local database

Map-Reduce + local AV (2015+) SmartAV smart availability combined MR search with local database keep data in RAM change stack to Java reduce constant factor distance to point for 100K hotels perl 0.4 s java 0.04 s use multithreading smaller overheads than IPC

Architecture

materialization search

search

replicas partitions

Coordinator acts as proxy knows cluster state query randomly chosen replica in all partitions (scatter-gather) retry if necessary merge partial results into final result

replicas partitions

Inverted indexes dataset 0 hello world 1 small world 2 goodbye world { } "hello" => [ 0 ], "goodbye" => [ 2 ], "small" => [ 1 ], "world" => [ 0, 1, 2 ] # must be sorted query (hello OR goodbye) AND world ([ 0 ] OR [ 2 ]) AND [ 0, 1, 2] merge [ 0, 2 ] indexes for ufi, country, region, district and more

Application server / database filter base on search criteria (stars, Wi-Fi, parking, etc.) base on group matching (# of rooms and persons per room) base on availability (check-in and check-out dates) sort price, distance, review score, etc. top N merge

Application server / database data statically partitioned (modulo partitioning by hotel id) hotel data kept in RAM not persisted easy enough to fetch and rebuild updated hourly availability data persisted real-time updates 1

RocksDB embedded key-value storage LSM log-structured merge-tree database

Why RocksDB? needed embedded key-value storage tried MapDB, Kyoto/Tokyo cabinet, leveldb reason of choice stable random read performance under random writes and compaction (80% reads, 20% writes) works on HDDs with ~1.5K updates per second dataset fits in RAM (in-memory workload)

RocksDB use and configuration RocksDB v3.13.1 JNI + custom patch config is result of iterative try-andfail approach optimized for read-latency mmap reads compress on app level WriteBatchWithIndex for read-yourown-writes multiple smaller DBs instead of one big simplify purging old availability config:.setdisabledatasync(false).setwritebuffersize(15 * SizeUnit.MB).setMaxOpenFiles(-1).setLevelCompactionDynamicLevelBytes(true).setMaxBytesForLevelBase(160 * SizeUnit.MB).setMaxBytesForLevelMultiplier(10).setTargetFileSizeBase(15 * SizeUnit.MB).setAllowMmapReads(true).setMemTableConfig(newHashSkipListMemTableConfig()).setMaxBackgroundCompactions(1).useFixedLengthPrefixExtractor(8).setTableFormatConfig(new PlainTableConfig().setKeySize(8).setStoreIndexInFile(true).setIndexSparseness(8));

materialization

Materialized availability queue no replication between nodes simplify architecture calculate once simplify app logic no need to re-implement logic

Node consistency eventually consistent naturally fits business rely on monitoring/alerting quality checks observer compares results easy and fast to rebuild a node

Results

Results MR search vs. MR search + local AV + new tech. stack Adriatic coast (~30K hotels) before - 13s, after - 30ms Rome (~6K hotels) before 5s, after 20ms Sofia (~0.3K hotels) before 200ms, after - 10ms

Conclusion

Conclusion 1. search on top of normalized dataset in MySQL 2. search on top of pre-computed (flattened) dataset in MySQL 3. MR-search on top of pre-computed dataset in MySQL 4. MR-search on top of local dataset in RocksDB (authoritative dataset in MySQL) full rewrite, but conceptually a small step locality matters technology stack (constant factor) matters

Thank you! ivan.kruglov@booking.com