Data Management in the Cloud: Limitations and Opportunities. Daniel Abadi Yale University January 30 th, 2009

Similar documents
HadoopDB: An open source hybrid of MapReduce

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

CSE6331: Cloud Computing

Introduction to Data Management CSE 344

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

Introduction to Data Management CSE 344

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Data Management in the Cloud. Tim Kraska

Introduction to Computer Science. William Hsu Department of Computer Science and Engineering National Taiwan Ocean University

Cloud Analytics Database Performance Report

How to survive the Data Deluge: Petabyte scale Cloud Computing

Database in the Cloud Benchmark

Introduction to Database Systems CSE 414

A Comparison of Approaches to Large-Scale Data Analysis

Jozsef Patvarczki Comprehensive exam Due August 24 th, Subject: Distributed Database Systems. Q1) Map Reduce and Distributed Databases

Chapter 24 NOSQL Databases and Big Data Storage Systems

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

CompSci 516 Database Systems

Next-Generation Cloud Platform

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia Final Exam. Administrivia Final Exam

Query processing on raw files. Vítor Uwe Reus

OLTP vs. OLAP Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Benchmarking Cloud-based Data Management Systems

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Five Common Myths About Scaling MySQL

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Safe Harbor Statement

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CISC 7610 Lecture 2b The beginnings of NoSQL

QLIK INTEGRATION WITH AMAZON REDSHIFT

Distributed Data Store

Advanced Peer: Data sharing in network based on Peer to Peer

A Global In-memory Data System for MySQL Daniel Austin, PayPal Technical Staff

Scalability of web applications

Parallel Techniques for Big Data. Patrick Valduriez

It also performs many parallelization operations like, data loading and query processing.

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Analytics in the Cloud Mandate or Option?

CIB Session 12th NoSQL Databases Structures

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Introduction to NoSQL Databases

Chapter 20: Database System Architectures

Data Warehouse Appliance: Main Memory Data Warehouse

NewSQL Databases. The reference Big Data stack

5/2/16. Announcements. NoSQL Motivation. The New Hipster: NoSQL. Serverless. What is the Problem? Database Systems CSE 414

Database Systems CSE 414

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Distributed computing: index building and use

Scott Meder Senior Regional Sales Manager

Data Processing in Cloud with AVL Structure and Bootstrap Access Control

Extreme Computing. NoSQL.

SCALABLE CONSISTENCY AND TRANSACTION MODELS

Lecture 23 Database System Architectures

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

A MapReduce Relational-Database Index-Selection Tool

Taming Structured And Unstructured Data With SAP HANA Running On VCE Vblock Systems

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Data Informatics. Seon Ho Kim, Ph.D.

Microsoft Big Data and Hadoop

Introduction to Data Management CSE 344

CSE 344 JULY 9 TH NOSQL

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Apache Ignite and Apache Spark Where Fast Data Meets the IoT

Migrating Oracle Databases To Cassandra

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Parallel Techniques for Big Data

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Introduction To Cloud Computing

Introduction to NoSQL

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Oracle GoldenGate for Big Data

Rule 14 Use Databases Appropriately

5 Fundamental Strategies for Building a Data-centered Data Center

VOLTDB + HP VERTICA. page

CAP Theorem. March 26, Thanks to Arvind K., Dong W., and Mihir N. for slides.

Quobyte The Data Center File System QUOBYTE INC.

Chapter 18: Parallel Databases

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

MULTICORE IN DATA APPLIANCES. Gustavo Alonso Systems Group Dept. of Computer Science ETH Zürich, Switzerland

Scalable Tools - Part I Introduction to Scalable Tools

Massive Scalability With InterSystems IRIS Data Platform

The Hazards of Multi-writing in a Dual-Master Setup

Advanced Database Technologies NoSQL: Not only SQL

Amazon AWS-Solution-Architect-Associate Exam

10/18/2017. Announcements. NoSQL Motivation. NoSQL. Serverless Architecture. What is the Problem? Database Systems CSE 414

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Transcription:

Data Management in the Cloud: Limitations and Opportunities Daniel Abadi Yale University January 30 th, 2009

Want milk with your breakfast? Buy a cow Big upfront cost Produces more (or less) milk than you need Uses up resources Time spent maintaining it Unpleasant waste product Buy bottled milk Continued cost Buy what you need Less resource intensive No maintenance Waste somebody else s problem

Your Computer is a Cow Your computer Big upfront cost Produces more (or less) milk than you need Uses up resources (electricity) Time spent maintaining it Produces unpleasant waste (heat, noise) What if you could get computing power even more conveniently than bottled milk?

Cloud Computing is Bottled Milk Companies willing to rent computing resources from their data centers Resources include storage, processing cycles, software stacks Google,, Microsoft, Amazon, Sun, Hewlett- Packard, Yahoo, EMC, and AT&T all taking part E.g., for $0.10/hour Amazon will give you: 1.7 GB memory Equivalent of 1.2 GHz processor 350GB storage

Cloud Computing Concerns What if my data or service provider becomes unavailable? What if my supplier suddenly increases how much they charge me? What about security? What about lock in?

Cloud Computing Concerns What if my data or service provider becomes unavailable? What if my supplier suddenly increases Remember: how much they bottled charge milk me? is SOOO much cheaper and more convenient! What about security? What about lock in?

Key Cloud Characteristics for DBMS Deployment Compute power is elastic But only if workload is parallelizable Want shared-nothing DBMS Data is stored at an untrusted host Data is replicated, often across large geographic distances Done under the covers E.g., Amazon s regions and availability zones

Xactional DBMS Applications Problems: Xactional DBMSs are typically not shared-nothing It is hard to maintain ACID guarantees in the face of replication across large distances CAP theorem: consistency, availability, tolerance to partitions choose two SimpleDB,, PNUTS relax consistency BigTable,, Microsoft SQL Server Data Services relax atomicity Large risks when storing operational data on an untrusted host

Analytical DBMS Applications Great fit for cloud deployment: Shared-nothing is becoming standard E.g., Teradata, Vertica, DATAllegro, Dataupia, Greenplum,, Aster Data, DB2 DPF, Exadata, Netezza ACID guarantees are not needed Sensitive data can be left out of the analysis $5 billion market (1/3 rd of DBMS market)

Cloud DBMS Wish List Efficiency Pricing model makes this paramount Fault tolerance Failures are common Want no data loss Want no work loss

Cloud DBMS Wish List Ability to run in a heterogeneous environment It is nearly impossible to keep machines all running at the same speed Ability to interface with BI products I.e. SQL, ODBC, JDBC interfaces Scale, scale, scale!

Data Analysis in the Cloud Parallel databases are the obvious choice right? Interface with BI products Compete fiercely on efficiency/performance Scale horizontally

Parallel Database Scalability

Parallel Database Scalability Try scaling them to 1000 nodes Strange network bottlenecks Restarting queries on a failure actually matter Heterogeneous node effects

We want something That can handle enormous scale... Does not restart queries upon a failure Designed for heterogeneous environments MapReduce?

But MapReduce Doesn t t interface with BI applications Is extremely inefficient

Efficiency CREATE TABLE UserVisits ( sourceip VARCHAR(16), desturl VARCHAR(100), visitdate DATE, adrevenue FLOAT, useragent VARCHAR(64), countrycode VARCHAR(3), langcode VARCHAR(6), searchword VARCHAR(32), duration INT); SELECT SUBSTR(sourceIP,, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP,, 1, 7);

Conclusion Data analysis well suited for the cloud No current software meets all elements on wish-list A hybrid between parallel databases and MapReduce is called for (Kamil( Bajda- Pawlikowski and Azza Abouzeid to the rescue)

Come Join the Yale DB Group!