LazyBase: Trading freshness and performance in a scalable database

Similar documents
Trading latency for freshness in storage systems

VOLTDB + HP VERTICA. page

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Shen PingCAP 2017

R-Store: A Scalable Distributed System for Supporting Real-time Analytics

Webinar Series TMIP VISION

Challenges for Data Driven Systems

Was ist dran an einer spezialisierten Data Warehousing platform?

Data-Intensive Distributed Computing

A Fast and High Throughput SQL Query System for Big Data

Oracle Exadata: Strategy and Roadmap

Evolution of Database Systems

Safe Harbor Statement

Embedded Technosolutions

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Advanced Database Technologies NoSQL: Not only SQL

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

In-Memory Data Management Jens Krueger

Big Data Analytics. Rasoul Karimi

Real-time Calculating Over Self-Health Data Using Storm Jiangyong Cai1, a, Zhengping Jin2, b

HYRISE In-Memory Storage Engine

SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less

Big Data Technology Incremental Processing using Distributed Transactions

BIG DATA COURSE CONTENT

YCSB++ Benchmarking Tool Performance Debugging Advanced Features of Scalable Table Stores

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Part 1: Indexes for Big Data

Main-Memory Databases 1 / 25

New Approaches to Big Data Processing and Analytics

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

microsoft

Flash Storage Complementing a Data Lake for Real-Time Insight

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

S-Store: Streaming Meets Transaction Processing

CISC 7610 Lecture 2b The beginnings of NoSQL

Stages of Data Processing

DATA MINING TRANSACTION

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

IoT Data Storage: Relational & Non-Relational Database Management Systems Performance Comparison

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Trading Freshness for Performance in Distributed Systems James Cipar CMU-CS

DATABASE DESIGN II - 1DL400

relational Key-value Graph Object Document

Automating Information Lifecycle Management with

The Future of High Performance Computing

Huge market -- essentially all high performance databases work this way

Time Series Live 2017

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Introduction to Big-Data

Teradata Analyst Pack More Power to Analyze and Tune Your Data Warehouse for Optimal Performance

In-Memory Data Management

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Intelligent Caching in Data Virtualization Recommended Use of Caching Controls in the Denodo Platform

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

New Oracle NoSQL Database APIs that Speed Insertion and Retrieval

Azure Cosmos DB. Planet Earth Scale, for now. Mike Sr. Consultant, Microsoft

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Performance and Scalability Overview

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

CS 61C: Great Ideas in Computer Architecture. MapReduce

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Big Data with Hadoop Ecosystem

Play2SDG: Bridging the Gap between Serving and Analytics in Scalable Web Applications

Evolving To The Big Data Warehouse

In-Memory Technology in Life Sciences

Data Analytics at Logitech Snowflake + Tableau = #Winning

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Microsoft Exam

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

An Introduction to Big Data Formats

CSE 344 Final Review. August 16 th

Advanced Databases: Parallel Databases A.Poulovassilis

The Google File System

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Next-Generation Cloud Platform

Unit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

Data Storage Infrastructure at Facebook

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

CS 655 Advanced Topics in Distributed Systems

Data Access 3. Managing Apache Hive. Date of Publish:

Lecture 11 Hadoop & Spark

Distributed computing: index building and use

SQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

YCSB++ benchmarking tool Performance debugging advanced features of scalable table stores

CSE 124: Networked Services Lecture-16

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Heckaton. SQL Server's Memory Optimized OLTP Engine

Windows Azure Services - At Different Levels

Big Data Analytics CSCI 4030

Transcription:

LazyBase: Trading freshness and performance in a scalable database (EuroSys 2012) Jim Cipar, Greg Ganger, *Kimberly Keeton, *Craig A. N. Soules, *Brad Morrey, *Alistair Veitch PARALLEL DATA LABORATORY Carnegie Mellon University * HP Labs

(very) High-level overview LazyBase is Distributed data base High-throughput, rapidly changing data sets Efficient queries over consistent snapshots Tradeoff between query latency and freshness http://www.pdl.cmu.edu/ 2

Query freshness There is delay when loading data into database Data may not be visible to read() immediately In some cases delay could be minutes or hours Freshness is an indication of this delay These results contain all data as of 15 seconds ago http://www.pdl.cmu.edu/ 3

Query latency Time between issuing a query, and receiving results Can range from milliseconds to hours depending on complexity of query and amount of data involved LazyBase allows programmers to choose a tradeoff between query freshness and latency http://www.pdl.cmu.edu/ 4

Example application http://www.pdl.cmu.edu/ 5

Example application High bandwidth stream of Tweets Many thousands per second 200 million per day http://www.pdl.cmu.edu/ 6

Example application High bandwidth stream of Tweets Many thousands per second 200 million per day Queries accept different freshness levels Freshest: USGS Twitter Earthquake Detector Fresh: Hot news in last 10 minutes Stale: social network graph analysis Consistency is important Tweets can refer to earlier tweets Some apps look for cause-effect relationships http://www.pdl.cmu.edu/ 7

Class of analytical applications Performance Continuous high-throughput updates Big analytical queries (scan large parts of data set) Freshness Varying freshness requirements for queries Freshness varies by query, not data set Consistency There are many types, more on this later... http://www.pdl.cmu.edu/ 8

Applications and freshness Freshness / Domain Seconds Minutes Hours+ Retail Enterprise information management Real-time coupons, targeted ads Infected machine identification Transportation Emergency response Just-in-time inventory File-based policy validation Real-time traffic maps Product search, earnings reports E-discovery requests, search Traffic engineering, route planning http://www.pdl.cmu.edu/ 9

Current solutions Online transaction processing databases (OLTP) Data warehouse systems NoSQL databases http://www.pdl.cmu.edu/ 10

Online transaction processing Typical databases supporting SQL language Focus on consistency and reliability Typically used for short transactions Highly selective query with index or small inserts Support continuous ingest and querying Performance often not sufficient for data mining [Chaudhuri 97], [Plattner 09] http://www.pdl.cmu.edu/ 11

Data warehouse systems Sometimes called Online Analytical Processing Used for analyzing large data sets Efficient scans and aggregations Designed for data mining: fast efficient queries Data loaded by batch jobs: Extract-transform-load pipeline (ETL) At any given time data may be many hours old http://www.pdl.cmu.edu/ 12

NoSQL databases Many diverse architectures Cassandra, HBase, Hypertable, MongoDB Designed for scalability on large clusters Scalable ingest and query Sacrifice consistency for scalability http://www.pdl.cmu.edu/ 13

Current Solutions Performance Freshness Consistency OLTP/ SQL DBs Data warehouse NoSQL http://www.pdl.cmu.edu/ 14

Current Solutions Performance Freshness Consistency OLTP Data warehouse LazyBase is designed to support all three NoSQL http://www.pdl.cmu.edu/ 15

Key ideas Separate concepts of consistency and freshness Batching to improve throughput, provide consistency Trade latency for freshness Can choose on a per-query basis Fast queries over stale data Slower queries over fresh data http://www.pdl.cmu.edu/ 16

Consistency freshness Separate concepts of consistency and freshness Query results may be stale: missing recent writes Query results must be consistent http://www.pdl.cmu.edu/ 17

Query (or read) transactions A query transaction is a group of queries Each query asks for some subset of the data All rows where the user name is jcipar Possibly aggregated in different ways The average size of tweets from user jcipar http://www.pdl.cmu.edu/ 18

Write transactions Group of inserts, updates and deletes Insert add a new row with these values Update modify row with id X Delete remove row with id X http://www.pdl.cmu.edu/ 19

Snapshot isolation Commonly used in OLTP databases Simple case: read-only/write-only transactions Write transactions applied atomically to most recent version of database Read transactions act on single version in database Every query in the transaction accesses same snapshot http://www.pdl.cmu.edu/ 20

Consistency in LazyBase Atomic multi-row updates Monotonicity: If a query sees update A, all subsequent queries will see update A Consistent prefix: Total ordering of updates If a query sees update number X, it will also see updates 1 (X-1) http://www.pdl.cmu.edu/ 21

LazyBase limitations Only supports observational data Transactions are read-only or write-only No read-modify-write Not online transaction processing Not (currently) targeting really huge scale 10s of servers, not 1000s Not everyone is a Google (or a Facebook ) http://www.pdl.cmu.edu/ 22

LazyBase design LazyBase is a distributed database Commodity servers (e.g. 8 CPU cores, 16 GB RAM) Can use direct attached storage Each server runs: General purpose worker process Ingest server that accepts client requests Query agent that processes query requests Logically LazyBase is a pipeline Each pipeline stage can be run on any worker Single stage may be parallelized on multiple workers http://www.pdl.cmu.edu/ 23

Pipelined data flow Client Ingest http://www.pdl.cmu.edu/ 24

Pipelined data flow Old authority Client Ingest Sort Merge Authority table http://www.pdl.cmu.edu/ 25

Batching updates Batching for performance and atomicity Common technique for throughput E.g. bulk loading of data in data warehouse ETL Also provides basis for atomic multi-row operations Batches are applied atomically and in-order Called SCU (self-consistent update file) SCUs contain inserts, updates, deletes http://www.pdl.cmu.edu/ 26

Batching for performance Large batches of updates increase throughput http://www.pdl.cmu.edu/ 27

Problem with batching: latency Batching trades update latency for throughput Large batches database is very stale Very large batches/busy system could be hours old OK for some queries, bad for others http://www.pdl.cmu.edu/ 28

Put the lazy in LazyBase As updates are processed through pipeline, they become progressively easier to query. We can use this to trade query latency for freshness. http://www.pdl.cmu.edu/ 29

Query freshness Old authority Client Ingest Sort Merge Authority table http://www.pdl.cmu.edu/ 30

Query freshness Old authority Client Ingest Sort Merge Raw Input Sorted Authority table Slow, fresh Fast, stale http://www.pdl.cmu.edu/ 31

Query freshness Quake! Hot news! Graph analysis Raw Input Sorted Authority table Slow, fresh Fast, stale http://www.pdl.cmu.edu/ 32

Query freshness Quake! Hot news! Graph analysis Raw Input Sorted Authority table Slow, fresh Fast, stale http://www.pdl.cmu.edu/ 33

Query freshness Quake! Hot news! Graph analysis Raw Input Sorted Authority table Slow, fresh Fast, stale http://www.pdl.cmu.edu/ 34

Query freshness Quake! Hot news! Graph analysis Raw Input Sorted Authority table Slow, fresh Fast, stale http://www.pdl.cmu.edu/ 35

Query interface User issues high-level queries Programatically or like a limited subset of SQL Specifies freshness SELECT COUNT(*) FROM tweets WHERE user = jcipar FRESHNESS 30; Client library handles all the dirty work http://www.pdl.cmu.edu/ 36

Query latency/freshness Queries allowing staler results return faster http://www.pdl.cmu.edu/ 37

Experiments to show Importance of batching Freshness/performance tradeoff http://www.pdl.cmu.edu/ 38

Experiments to show Importance of batching Freshness/performance tradeoff Throughput and scalability of updates Performance for queries Both small and big queries Consistency relative to Cassandra Freshness relative to Cassandra http://www.pdl.cmu.edu/ 39

Experimental setup Ran on OpenCirrus cluster in DCO 6 dedicated cores, 12GB RAM, per server Local storage Data set was ~38 million tweets 50 GB uncompressed Compared to Cassandra Reputation as write-optimized NoSQL database http://www.pdl.cmu.edu/ 40

Experiments to show Importance of batching Freshness/performance tradeoff Throughput and scalability of ingest Performance for queries Both small and big queries Consistency relative to Cassandra Freshness relative to Cassandra http://www.pdl.cmu.edu/ 41

Ingest scalability experiment Measured time to ingest entire data set Uploaded in parallel from 20 servers Varied number of worker processes http://www.pdl.cmu.edu/ 42

Ingest scalability results LazyBase scales effectively up to 20 servers Efficiency is ~4x better than Cassandra http://www.pdl.cmu.edu/ 43

Experiments to show Importance of batching Freshness/performance tradeoff Throughput and scalability of ingest Performance for queries Both small and big queries Consistency relative to Cassandra Freshness relative to Cassandra http://www.pdl.cmu.edu/ 44

Query experiments Test performance of fastest queries Access only authority table Two types of queries: point and range Point queries get single tweet by ID Range queries get list of valid tweet IDs in range Range size chosen to return ~0.1% of all IDs Cassandra range queries used get_slice Actual range queries discouraged http://www.pdl.cmu.edu/ 45

Point query throughput Queries scale to multiple clients Raw performance suffers due to on-disk format http://www.pdl.cmu.edu/ 46

Range query throughput Range query performance ~4x Cassandra http://www.pdl.cmu.edu/ 47

Experiments to show Importance of batching Freshness/performance tradeoff Throughput and scalability of ingest Performance for queries Both small and big queries Consistency relative to Cassandra Freshness relative to Cassandra http://www.pdl.cmu.edu/ 48

Consistency experiments Goal: test atomicity of updates Table with 2 rows, A and B Each row stores an integer Write transactions simultaneously increment A and decrement B A+B should always be 0 http://www.pdl.cmu.edu/ 49

Sum = A+B LazyBase maintains inter-row consistency http://www.pdl.cmu.edu/ 50

Experiments to show Importance of batching Freshness/performance tradeoff Throughput and scalability of ingest Performance for queries Both small and big queries Consistency relative to Cassandra Freshness relative to Cassandra http://www.pdl.cmu.edu/ 51

Freshness experiment Goal: test freshness in consistency experiment Same 2 rows, A and B, but add timestamps Timestamp shows age of data in database A and B should have the same timestamp http://www.pdl.cmu.edu/ 52

Freshness LazyBase results may be stale, timestamps are nondecreasing http://www.pdl.cmu.edu/ 53

Summary Provide performance, consistency and freshness Batching improves update throughput, hurts latency Separate ideas of consistency and freshness Tradeoff between freshness and latency Use pipelined database to meet these needs Allow queries to access intermediate pipeline data http://www.pdl.cmu.edu/ 54

Future work Soft causality constraints LazyBase assumes updates have total ordering In reality it is often useful to reorder updates Shortest Job First policy for improved update latency Process high priority updates quickly Many systems use causal consistency http://www.pdl.cmu.edu/ 55

Causal consistency An update can be caused by previous updates A tweet may depend on a user s previous tweets A Tweet can refer to a previous tweet explicitly Allow updates to be reordered Must respect causalality If A caused B, then update A must happen before B http://www.pdl.cmu.edu/ 56

Causal consistency is hard Need to track causal relationships: Must maintain information about each update Can become a performance problem Often infeasible May be (practically) impossible Tweets refer to other tweets It is difficult to determine which update a Tweet was in http://www.pdl.cmu.edu/ 57

Estimating causality Causality may not be that important Some messages out of order Roll back and re-execute operations Can we estimate causal relationships using already available data? Relative time stamps User ID User popularity http://www.pdl.cmu.edu/ 58

Soft causal consistency Estimating causal relationships means we might occasionally violate them Instead of treating them as hard constraints, consider them another objective Trade off between causality violations and Latency Parallelism Intermediate results http://www.pdl.cmu.edu/ 59

Thanks! http://www.pdl.cmu.edu/ 60