Crescando: Predictable Performance for Unpredictable Workloads

Similar documents
DATABASES AND THE CLOUD. Gustavo Alonso Systems Group / ECC Dept. of Computer Science ETH Zürich, Switzerland

Research Collection. Daedalus a distributed crescando system. Master Thesis. ETH Library. Author(s): Giannikis, Georgios. Publication Date: 2009

Rack-scale Data Processing System

MULTICORE IN DATA APPLIANCES. Gustavo Alonso Systems Group Dept. of Computer Science ETH Zürich, Switzerland

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Performance in the Multicore Era

What is new in the cloud? Donald Kossmann ETH Zurich

Architecture-Conscious Database Systems

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Huge market -- essentially all high performance databases work this way

VOLTDB + HP VERTICA. page

Advanced Databases: Parallel Databases A.Poulovassilis

Data Processing on Emerging Hardware

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

Crazy little thing called hardware GUSTAVO ALONSO SYSTEMS GROUP DEPT. OF COMPUTER SCIENCE ETH ZURICH

HyPer-sonic Combined Transaction AND Query Processing

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Architecture of a Real-Time Operational DBMS

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

HyPer-sonic Combined Transaction AND Query Processing

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

SCYLLA: NoSQL at Ludicrous Speed. 主讲人 :ScyllaDB 软件工程师贺俊

Main-Memory Databases 1 / 25

Architecture and Implementation of Database Systems (Winter 2014/15)

HYRISE In-Memory Storage Engine

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

CompSci 516 Database Systems

What is new in the cloud? - A Database Perspective. Donald Kossmann Systems Group, ETH Zurich

In-Memory Data Management Jens Krueger

Greenplum Architecture Class Outline

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Scaling up analytical queries with column-stores

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Achieving Horizontal Scalability. Alain Houf Sales Engineer

SCHISM: A WORKLOAD-DRIVEN APPROACH TO DATABASE REPLICATION AND PARTITIONING

Evolution of Database Systems

Oracle Exadata: Strategy and Roadmap

Jignesh M. Patel. Blog:

Memory-Based Cloud Architectures

CSE 544: Principles of Database Systems

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Safe Harbor Statement

SAP HANA Scalability. SAP HANA Development Team

Introduction to Data Management CSE 344

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Exadata X3 in action: Measuring Smart Scan efficiency with AWR. Franck Pachot Senior Consultant

Oracle Performance on M5000 with F20 Flash Cache. Benchmark Report September 2011

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

A Fast and High Throughput SQL Query System for Big Data

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

A Brief Introduction of TiDB. Dongxu (Edward) Huang CTO, PingCAP

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Parallel DBMS. Chapter 22, Part A

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Oracle: From Client Server to the Grid and beyond

CSE 124: Networked Services Lecture-17

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

COURSE 12. Parallel DBMS

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

What s New in MySQL 5.7 Geir Høydalsvik, Sr. Director, MySQL Engineering. Copyright 2015, Oracle and/or its affiliates. All rights reserved.

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

5 Fundamental Strategies for Building a Data-centered Data Center

Beyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

Data Analytics at Logitech Snowflake + Tableau = #Winning

Shen PingCAP 2017

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Top Trends in DBMS & DW

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

Scaling Without Sharding. Baron Schwartz Percona Inc Surge 2010

Module 4. Implementation of XQuery. Part 0: Background on relational query processing

STORAGE SYSTEMS. Operating Systems 2015 Spring by Euiseong Seo

Next-Generation Cloud Platform

Introduction to Database Services

Introduction to Data Management CSE 344

Oracle Database In-Memory What s New and What s Coming

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

HP ProLiant DL380 Gen8 and HP PCle LE Workload Accelerator 28TB/45TB Data Warehouse Fast Track Reference Architecture

To Shard or Not to Shard That is the question! Peter Zaitsev April 21, 2016

Advanced Database Systems

HyPer on Cloud 9. Thomas Neumann. February 10, Technische Universität München

Sybase Adaptive Server Enterprise on Linux

Oracle Platform Performance Baseline Oracle 12c on Hitachi VSP G1000. Benchmark Report December 2014

NewSQL. Database Landscape From: the 451 group. OLTP Focus. NewSQL: Flying on ACID. Cloud DB, Winter 2014, Lecture 14

BIS Database Management Systems.

Scaling App Engine Applications. Justin Haugh, Guido van Rossum May 10, 2011

IT Best Practices Audit TCS offers a wide range of IT Best Practices Audit content covering 15 subjects and over 2200 topics, including:

Design of Flash-Based DBMS: An In-Page Logging Approach

Transcription:

Crescando: Predictable Performance for Unpredictable Workloads G. Alonso, D. Fauser, G. Giannikis, D. Kossmann, J. Meyer, P. Unterbrunner Amadeus S.A. ETH Zurich, Systems Group (Funded by Enterprise Computing Center)

Overview Background & Problem Statement Approach Experiments & Results

Amadeus Workload Passenger-Booking Database ~ 600 GB of raw data (two years of bookings) single table, denormalized ~ 50 attributes: flight-no, name, date,..., many flags Query Workload up to 4000 queries / second latency guarantees: 2 seconds today: only pre-canned queries allowed Update Workload avg. 600 updates per second (1 update per GB per sec) peak of 12000 updates per second data freshness guarantee: 2 seconds

Amadeus Query Examples Simple Queries Print passenger list of Flight LH 4711 Give me LH hon circle from Frankfurt to Delhi Complex Queries Give me all Heathrow passengers that need special assistance (e.g., after terror warning) Problems with State-of-the Art Simple queries work only because of mat. views multi-month project to implement new query / process Complex queries do not work at all

Why trad. DBMS are a pain? 20'000 MySQL Query 50th MySQL Query 90th MySQL Query 99th 9'000 8'000 Query Latency in msec 15'000 10'000 5'000 7'000 6'000 5'000 4'000 3'000 2'000 1'000 Query Latency in msec 0 0 20 40 60 80 100 Update Load in Updates/sec Performance depends on workload parameters changes in update rate, queries,... -> huge variance impossible / expensive to predict and tune correctly 2 1.75 1.5 Synthetic Workload Parameter s 0 1.25

Goals Predictable (= constant) Performance independent of updates, query types,... Meet SLAs latency, data freshness Affordable Cost ~ 1000 COTS machines are okay (compare to mainframe) Meet Consistency Requirements monotonic reads (ACID not needed) Respect Hardware Trends main-memory, NUMA, large data centers

Selected Related Work L. Qiao et. al. Main-memory scan sharing for multi-core CPUs. VLDB '08 Cooperative main-memory scans for ad-hoc OLAP queries (read-only) P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyperpipelining query execution. CIDR 05 Cooperative scans over vertical partitions on disk K. A. Ross. Selection conditions in main memory. In ACM TODS, 29(1), 2004. S. Chandrasekaran and M. J. Franklin. Streaming queries over streaming data VLDB '02 Query-data join G. Candea, N. Polyzotis, R. Vingralek. A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses. VLDB 09 An always on join operator based on similar requirements and design principles

Overview Background & Problem Statement Approach Experiments & Results

What is Crescando? A distributed (relational) table: MM on NUMA horizontally partitioned distributed within and across machines Query / update interface SELECT * FROM table WHERE <any predicate> UPDATE table SET <anything> WHERE <any predicate> monotonic reads / writes (SI within a single partition) Some nice properties constant / predictable latency & data freshness solves the Amadeus use case

Design Operate MM like disk in shared-nothing architect. Core ~ Spindle (many cores per machine & data center) all data kept in main memory (log to disk for recovery) each core scans one partition of data all the time Batch queries and updates: shared scans do trivial MQO (at scan level on system with single table) control read/update pattern -> no data contention Index queries / not data just as in the stream processing world predictable+optimizable: rebuild indexes every second Updates are processed before reads

Crescando in Data Center (N Machines)

Crescando on 1 Machine (N Cores) Scan Thread Scan Thread Input Queue (Operations) Split Scan Thread Scan Thread Merge Output Queue (Result Tuples)... Input Queue (Operations) Scan Thread Output Queue (Result Tuples)

{record, {query-ids} } results is Predicate Indexes Queries + Upd. qs Unindexed Queries Active Queries Record 0 records Crescando on 1 Core Snapshot n Snapshot n+1 data partition Read Cursor Write Cursor

Scanning a Partition Record 0 Snapshot n+1 Snapshot n Read Cursor Write Cursor

Scanning a Partition Record 0 Snapshot n+1 Snapshot n Read Cursor Write Cursor Merge cursors

Scanning a Partition Record 0 Build indexes for next batch of queries and updates Snapshot n+1 Snapshot n Read Cursor Write Cursor Merge cursors

Crescando @ Amadeus Queries (Oper. BI) Transactions (OLTP) Aggregator Aggregator Aggregator Aggregator Aggregator Key / Value Mainframe Query / {Key} Update stream (queue) Store (e.g., S3) Store (e.g., S3) Crescando Nodes

Implementation Details Optimization decide for batch of queries which indexes to build runs once every second (must be fast) Query + update indexes different indexes for different kinds of predicates e.g., hash tables, R-trees, tries,... must fit in L2 cache (better L1 cache) Probe indexes Updates in right order, queries in any order Persistence & Recovery Log updates / inserts to disk (not a bottleneck)

Crescando in the Cloud Client HTTP XML, JSON, HTML Web Server FCGI,... XML, JSON, HTML App Server SQL records DB Server get/put block Store

Crescando in the Cloud Client Client Client Client HTTP FCGI,... SQL Web Server App Server DB Server XML, JSON, HTML XML, JSON, HTML records Web/App Aggregator Workload Splitter XML, JSON, HTML Web/App Aggregator queries/updates <-> records Store (e.g., S3) Store (e.g., S3) Crescando Nodes get/put block Store

Overview Background & Problem Statement Approach Experiments & Results

Benchmark Environment Crescando Implementation Shared library for POSIX systems Heavily optimized C++ with some inline assembly Benchmark Machines 16 core Opteron machine with 32 GB DDR2 RAM 64-bit Linux SMP kernel, ver. 2.6.27, NUMA enabled Benchmark Database The Amadeus Ticket view (one record per passenger per flight) ~350byte per record; 47 attributes, many of them flags Benchmarks use 15 GB of net data Query + Update Workload Current: Amadeus Workload (from Amadeus traces) Predicted: Synthetic workload with varying predicate selectivity

Multi-core Scale-up 558.5 Q/s 10.5 Q/s 1.9 Q/s Round-robin partitioning, read-only Amadeus workload, vary number of threads

Latency vs. Query Volume thrashing, queue overflows L1 cache base latency of scan L2 cache Hash partitioning, read-only Amadeus workload, vary queries/sec

Latency vs. Concurrent Writes Hash partitioning, Amadeus workload, 2000 queries/sec, vary updates

Crescando vs. MySQL - Latency updates + big queries cause massive queuing s = 1.4: 1 / 3,000 queries do not hit an index s = 1.5: 1 / 10,000 queries do not hit an index 16s = time for full-table scan in MySQL Amadeus workload, 100 q/sec, vary updates Synthetic read-only workload, vary skew

Crescando vs. MySQL - Throughput read-only workload! Amadeus workload, vary updates Synthetic read-only workload, vary skew

Equivalent Annual Cost (2009) 1'000.00 EAC/GB of Crescando Storage 900.00 800.00 700.00 600.00 EAC/GB 500.00 400.00 300.00 200.00 100.00 0.00 0 1 2 3 4 5 Years of Ownership 8 x Opteron 8439 SE 4 x Opteron 8439 SE 4 x Opteron 8393 SE 4 x Xeon X7460 4 x Xeon E7450

Summary of Experiments high concurrent query + update throughput Amadeus: ~4000 queries/sec + ~1000 updates/sec updates do not impact latency of queries predictable and guaranteed latency depends on size of partition: not optimal, good enough cost and energy effeciency depends on workload: great for hot data, heavy WL consistency: write monotonicity, can build SI on top works great on NUMA! controls read+write pattern linear scale-up with number of cores

Status & Outlook Status Fully operational system Extensive experiments at Amadeus Production: Summer 2011 (planned) Outlook Column store variant of Crescando Compression E-cast: flexible partitioning & replication Joins over normalized data, Aggregation,...

Conclusion A new way to process queries Massively parallel, simple, predictable Not always optimal, but always good enough Ideal for operational BI High query throughput Concurrent updates with freshness guarantees Great building block for many scenarios Rethink database and storage system architecture