What to do with Scientific Data? Michael Stonebraker

Similar documents
SciDB An Open Source Data Base Project. Paul Brown* *medical emergency trumped presence

SS-DB: A Standard Science DBMS Benchmark

New Direction for TPC. Michael Stonebraker

Column-Stores vs. Row-Stores: How Different Are They Really?

Retrospective Satellite Data in the Cloud: An Array DBMS Approach* Russian Supercomputing Days 2017, September, Moscow

Hybrid Shipping Architectures: A Survey

Emerging Database Technologies and Their Applicability to High Energy Physics: A First Look at SciDB

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

5 Fundamental Strategies for Building a Data-centered Data Center

Rajiv GandhiCollegeof Engineering& Technology, Kirumampakkam.Page 1 of 10

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Microsoft Developing SQL Databases

Microsoft. [MS20762]: Developing SQL Databases

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

CAS CS 460/660 Introduction to Database Systems. Fall

Using PostgreSQL in Tantan - From 0 to 350bn rows in 2 years

2.3 Algorithms Using Map-Reduce

PowerCenter 7 Architecture and Performance Tuning

CSE 544: Principles of Database Systems

Introduction to Data Management CSE 344

It also performs many parallelization operations like, data loading and query processing.

Traditional RDBMS Wisdom is All Wrong -- In Three Acts "

Understanding Geospatial Data Models

Cloud Computing & Visualization

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Introduction to Data Management CSE 344

SQL Server Development 20762: Developing SQL Databases in Microsoft SQL Server Upcoming Dates. Course Description.

WHITEPAPER. MemSQL Enterprise Feature List

Basics of Data Management

DURATION : 03 DAYS. same along with BI tools.

20762B: DEVELOPING SQL DATABASES

Developing SQL Databases

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Introduction to NoSQL

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Database Architectures

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Introduction and Overview

Goals for Today. CS 133: Databases. Relational Model. Multi-Relation Queries. Reason about the conceptual evaluation of an SQL query

Appliances and DW Architecture. John O Brien President and Executive Architect Zukeran Technologies 1

High-Performance Distributed DBMS for Analytics

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Performance Evaluation of Multidimensional Array Storage Techniques in Databases

CIB Session 12th NoSQL Databases Structures

Approaching the Petabyte Analytic Database: What I learned

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Huge market -- essentially all high performance databases work this way

Postgres Plus and JBoss

Lab 9. Julia Janicki. Introduction

Databasesystemer, forår 2005 IT Universitetet i København. Forelæsning 8: Database effektivitet. 31. marts Forelæser: Rasmus Pagh

Cosmic Peta-Scale Data Analysis at IN2P3

"Charting the Course... MOC C: Developing SQL Databases. Course Summary

Using PostgreSQL, Prometheus & Grafana for Storing, Analyzing and Visualizing Metrics

CMPT 354 Database Systems I. Spring 2012 Instructor: Hassan Khosravi

Chapter 5. Indexing for DWH

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Basant Group of Institution

SciSpark 201. Searching for MCCs

Advanced Data Management Technologies Written Exam

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Mahathma Gandhi University

Oracle Autonomous Database

Introduction to Data Management CSE 344

An exploration of SciDB in the context of emerging technologies for data stores in particle physics and cosmology

Dealing with Data Especially Big Data

Cosmic Peta-Scale Data Analysis at IN2P3

Data Management in the Cloud: Limitations and Opportunities. Daniel Abadi Yale University January 30 th, 2009

MongoDB Schema Design

Introduction to Database Systems CSE 414

SQL Simple Queries. Chapter 3.1 V3.01. Napier University

Apache Ignite and Apache Spark Where Fast Data Meets the IoT

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Advanced Database Systems

SQL Server SQL Server 2008 and 2008 R2. SQL Server SQL Server 2014 Currently supporting all versions July 9, 2019 July 9, 2024

<Insert Picture Here> DBA s New Best Friend: Advanced SQL Tuning Features of Oracle Database 11g

Introduction to Data Management. Lecture #1 (Course Trailer ) Instructor: Chen Li

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

EsgynDB Enterprise 2.0 Platform Reference Architecture

Survey of Oracle Database

Systems Analysis & Design

Kenneth A. Hawick P. D. Coddington H. A. James

Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection

Splunk is a great tool for exploring your log data. It s very powerful, but

Independent consultant. Oracle ACE Director. Member of OakTable Network. Available for consulting In-house workshops. Performance Troubleshooting

CSE 344 JULY 9 TH NOSQL

CSE544 Database Architecture

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

Announcements. From SQL to RA. Query Evaluation Steps. An Equivalent Expression

Chapter 3: Data Mining:

Spatial Data Management

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Cloud Computing. What is cloud computing. CS 537 Fall 2017

Module 9: Selectivity Estimation

PostgreSQL: Hyperconverged DBMS

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

Independent consultant. Oracle ACE Director. Member of OakTable Network. Available for consulting In-house workshops. Performance Troubleshooting

Transcription:

What to do with Scientific Data? by Michael Stonebraker

Outline Science data what it looks like Hardware options for deployment Software options RDBMS Wrappers on RDBMS SciDB

Courtesy of LSST. Used with permission. O(100) petabytes

Courtesy of LSST. Used with permission.

LSST Data Raw imagery 2-D arrays of telescope readings Cooked into observations Image intensity algorithm (data clustering) Spatial data Further cooked into trajectories Similarity query Constrained by maximum distance

Example LSST Queries Recook raw imagery with my algorithm Find all observations in a spatial region Find all trajectories that intersect a cylinder in time

Snow Cover in the Sierras Source unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse.

Satellite Imagery Raw data Array of pixels precessing around the earth Spherical co-ordinates Cooked into images Typically best pixel over a time window i.e. image is a composite of several passes Further cooked into various other things E.g. polygons of constant snow cover

Example Queries Recook raw data Using a different composition algorithm Retrieve cooked imagery in a time cylinder Retrieve imagery which is changing at a large rate

Chemical Plant Data Plant is a directed graph of plumbing Sensors at various places (1/sec observations) Directed graph of time series To optimize output plant runs near the edge And fails every once in a while down for a week

Chemical Plant Data Record all data {(time, sensor-1, sensor-5000)} Look for interesting events i.e. sensor values out of whack Cluster events near each other in 5000 dimension space Idea is to identify near-failure modes

General Model sensors Derived data Cooking Algorithm(s) (pipeline)

Traditional Wisdom Cooking pipeline outside DBMS Derived data loaded into DBMS for subsequent querying

Problems with This Approach Easy to lose track of the raw data Cannot query the raw data Recooking is painful in application logic might be easier in a DBMS (stay tuned) Provenance (meta data about the data) is often not captured E.g. cooking parameters E.g. sensor calibration

My preference Load the raw data into a DBMS Cooking pipeline is a collection of user-defined functions (DBMS extensions) Activated by triggers or a workflow management system ALL data captured in a common system!!!

Deployment Options Supercomputer/mainframe Individual project silos Internal grid (cloud behind the firewall) External cloud (e.g. Amazon EC20

Deployment Options Supercomputer/main frame ($$$$) Individual project silos Probably what you do now. Every silo has a system administrator and a DBA (expensive) Generally results in poor sharing of data

Deployment Options Internal grid (cloud behind the firewall) Mimic what Google/Amazon/Yahoo/et.al do Other report huge savings in DBA/SE costs Does not require you buy VMware Requires a software stack that can enforce service guarantees

Deployment Options External cloud (e.g. EC2) Amazon can stand up a node wildly cheaper than Exxon economies of scale from 10K nodes to 500K nodes Security/company policy issues will be an issue Amazon pricing will be an issue Likely to be the cheapest in the long run

What DBMS to Use? RDBMS (e.g. Oracle) Pretty hopeless on raw data Simulating arrays on top of tables likely to cost a factor of 10-100 Not pretty on time series data Find me a sensor reading whose average value over the last 3 days is within 1% of the average value over the adjoining 5 sensors

What DBMS to Use? RDBMS (e.g. Oracle) Spatial data may (or may not) be ok Cylinder queries will probably not work well 2-D rectangular regions will probably be ok Look carefully at spatial indexing support (usually R-trees)

RDBMS Summary Wrong data model Arrays not tables Wrong operations Regrid not join Missing features Versions, no-overwrite, provenance, support for uncertain data,

But your mileage may vary SQLServer working well for Sloan Skyserver data base See paper in CIDR 2009 by Jose Blakeley

How to Do Analytics (e.g.clustering) Suck out the data Convert to array format Pass to MatLab, R, SAS, Compute Return answer to DBMS

Bad News Painful Slow Many analysis platforms are main memory only

RDBMS Summary Issues not likely to get fixed any time soon Science is small compared to business data processing

Wrapper on Top of RDBMS -- MonetDB Arrays simulated on top of tables Layer above RDBMS will replace SQL with something friendlier to science But will not fix performance problems!! Bandaid solution

RasDaMan Solution An array is a blob or array is cut into chunks and stored as a collection of blobs Array DBMS is in user-code outside DBMS Uses RDBMS as a reliable (but slow) file system Grid support looks especially slow

My Proposal -- SciDB Build a commercial-quality array DBMS from the ground up.

SciDB Data Model Nested multidimensional arrays Augmented with co-ordinate systems (floating point dimensions) Ragged arrays Array values are a tuple of values and arrays

Data Storage Optimized for both dense and sparse array data Different data storage, compression, and access Arrays are chunked (in multiple dimensions) Chunks are partitioned across a collection of nodes Chunks have overlap to support neighborhood operations Replication provides efficiency and back-up Fast access to data sliced along any dimension Without materialized views

SciDB DDL CREATE ARRAY Test_Array < A: integer NULLS, B:double, C:USER_DEFINED_TYPE > [I=0:99999,1000, 10, J=0:99999,1000, 10 ] PARTITION OVER ( Node1, Node2, Node3 ) USIN G block_cyclic(); attribute names A, B, C index names I, J chunk size 1000 overlap 10

Array Query Language (AQL) Array data management (e.g. filter, aggregate, join, etc.) Stat operations (multiply, QR factor, etc.) Parallel, disk-oriented User-defined operators (Postgres-style) Interface to external stat packages (e.g. R)

Array Query Language (AQL) SELECT Geo-Mean ( T.B ) FROM Test_Array T WHERE T.IBETWEEN :C1 AND :C2 AND T.J BET W E E N :C3 A N D :C4 AND T.A = 10 GROUP BY T.I; User-defined aggregate on an attribute B in array T Subsample Filter Group-by So far as SELECT / FROM / WHERE / GROUP BY queries are concerned, there is little logical difference between AQL and SQL

Matrix Multiply CREATE ARRAY TS_Data < A1:int32, B1:double > [ I=0:99999,1000,0, J=0:3999,100,0 ] Select multiply (TS_data.A1, test_array.b) Smaller of the two arrays is replicated at all nodes Scatter-gather Each node does its core of the bigger array with the replicated smaller one Produces a distributed answer

Shared nothing cluster 10 s 1000 s of nodes Commodity hardware TCP/IP between nodes Linear scale-up Architecture Each node has a processor and storage Queries refer to arrays as if not distributed Application Language Specific UI Runtime Supervisor Query Interface and Parser Plan Generator Application Layer Java, C++, whatever Doesn t require JDBC, ODBC Server Layer AQL an extension of SQL Also supports UDFs Query planner optimizes queries for efficient data access & processing Query plan runs on a node s local executor&storage manager Runtime supervisor coordinates execution Node 3 Node 2 Node 1 Local Executor Storage Manager Storage Layer

Other Features Which Science Guys Want (These could be in RDBMS, but Aren t) Uncertainty Data has error bars Which must be carried along in the computation (interval arithmetic)

Other Features Time travel Don t fix errors by overwrite I.e. keep all of the data Named versions Recalibration usually handled this way

Other Features Provenance (lineage) What calibration generated the data What was the cooking algorithm In general repeatability of data derivation

MIT OpenCourseWare http://ocw.mit.edu 6.830 / 6.814 Database Systems Fall 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.