Introduction to Data Management CSE 344

Similar documents
Introduction to Database Systems CSE 414

Introduction to Data Management CSE 344

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

CSE 544: Principles of Database Systems

CSE 344 MAY 2 ND MAP/REDUCE

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

Introduction to Data Management CSE 344

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce

Introduction to Data Management CSE 344

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Why compute in parallel?

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

COURSE 12. Parallel DBMS

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Database Management Systems CSEP 544. Lecture 6: Query Execution and Optimization Parallel Data processing

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Hadoop vs. Parallel Databases. Juliana Freire!

Introduction to Data Management CSE 344

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Data Management in the Cloud: Limitations and Opportunities. Daniel Abadi Yale University January 30 th, 2009

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

CSE6331: Cloud Computing

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Huge market -- essentially all high performance databases work this way

Introduction to Data Management CSE 344. Lecture 1: Introduction

Introduction to Data Management (Database Systems) CSE 414

Architecture and Implementation of Database Systems (Winter 2014/15)

Your New App. Motivation. Data Management is Universal. Staff. Introduction to Data Management (Database Systems) CSE 414. Lecture 1: Introduction

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

CSE 544, Winter 2009, Final Examination 11 March 2009

CSE 344 Final Review. August 16 th

Data-Intensive Distributed Computing

Database Architectures

Lecture 23 Database System Architectures

The Future of High Performance Computing

Parallel DBMS. Chapter 22, Part A

Announcements. From SQL to RA. Query Evaluation Steps. An Equivalent Expression

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

SAP HANA Scalability. SAP HANA Development Team

CSE 344 MAY 7 TH EXAM REVIEW

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

CS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald

Optimization Overview

BIG DATA TESTING: A UNIFIED VIEW

What is Data Warehouse like

Big Data - Some Words BIG DATA 8/31/2017. Introduction

CS352 Lecture: Database System Architectures last revised 11/22/06

VOLTDB + HP VERTICA. page

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Crescando: Predictable Performance for Unpredictable Workloads

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

On-Line Application Processing

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Embedded Technosolutions

Hadoop/MapReduce Computing Paradigm

CMPT 354: Database System I. Lecture 1. Course Introduction

How to survive the Data Deluge: Petabyte scale Cloud Computing

CPS352 Lecture: Database System Architectures last revised 3/27/2017

Parallel DBs. April 25, 2017

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

5 Fundamental Strategies for Building a Data-centered Data Center

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

ELTMaestro for Spark: Data integration on clusters

Chapter 20: Database System Architectures

In-Memory Data Management

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

CSE 344 JANUARY 3 RD - INTRODUCTION

MapReduce. Stony Brook University CSE545, Fall 2016

Technological Challenges in the GAIA Archive

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

CSE 417 Practical Algorithms. (a.k.a. Algorithms & Computational Complexity)

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

Database Systems CSE 414

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Parallel DBs. April 23, 2018

Data Processing at Scale (CSE 511)

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

L22: The Relational Model (continued) CS3200 Database design (sp18 s2) 4/5/2018

Semi-Structured Data Management (CSE 511)

Database Systems CSE 414

Chapter 17: Parallel Databases

Survey of Oracle Database

Principles of Software Construction: Objects, Design, and Concurrency. The Perils of Concurrency Can't live with it. Cant live without it.

Designing a scalable twitter

Architecture-Conscious Database Systems

Safe Harbor Statement

CS-460 Database Management Systems

Transcription:

Introduction to Data Management CSE 344 Lecture 25: Parallel Databases CSE 344 - Winter 2013 1

Announcements Webquiz due tonight last WQ! J HW7 due on Wednesday HW8 will be posted soon Will take more hours than other HWs (complex setup, queries run for many hours) No late days Plan ahead! Hold on to the email that Lee Lee sent you today Next four lectures: parallel databases Traditional, MapReduce+PigLatin CSE 344 - Winter 2013 2

Parallel Computation Today Two Major Forces Pushing towards Parallel Computing: Change in Moore s law Cloud computing CSE 344 - Winter 2013 3

Parallel Computation Today 1. Change in Moore's law* (exponential growth in transistors per chip density) no longer results in increased clock speeds Increased hw performance available only through parallelism Think multicore: 4 cores today, perhaps 64 in a few years * Moore's law says that the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years [Intel co-founder Gordon E. Moore described the trend in his 1965 paper and predicted that it will last for at least 10 years] 4

Parallel Computation Today 2. Cloud computing commoditizes access to large clusters Ten years ago, only Google could afford 1000 servers; Today you can rent this from Amazon Web Services (AWS) 5

Science is Facing a Data Deluge! Astronomy: Large Synoptic Survey Telescope LSST: 30TB/night (high-resolution, high-frequency sky surveys) Physics: Large Hadron Collider 25PB/year Biology: lab automation, high-throughput sequencing Oceanography: high-resolution models, cheap sensors, satellites Medicine: ubiquitous digital records, MRI, ultrasound CSE 344 - Winter 2013 6

Science is Facing a Data Deluge! CSE 344 - Winter 2013 7

Industry is Facing a Data Deluge! Clickstreams, search logs, network logs, social networking data, RFID data, etc. Facebook: 15PB of data in 2010 60TB of new data every day Google: In May 2010 processed 946PB of data using MapReduce Twitter, Google, Microsoft, Amazon, Walmart, etc. CSE 344 - Winter 2013 8

Industry is Facing a Data Deluge! CSE 344 - Winter 2013 9

Big Data Companies, organizations, scientists have data that is too big, too fast, and too complex to be managed without changing tools and processes Relational algebra and SQL are easy to parallelize and parallel DBMSs have already been studied in the 80's! CSE 344 - Winter 2013 10

Data Analytics Companies As a result, we are seeing an explosion of and a huge success of db analytics companies Greenplum founded in 2003 acquired by EMC in 2010; A parallel shared-nothing DBMS (this lecture) Vertica founded in 2005 and acquired by HP in 2011; A parallel, column-store shared-nothing DBMS (see 444 for discussion of column-stores) DATAllegro founded in 2003 acquired by Microsoft in 2008; A parallel, shared-nothing DBMS Aster Data Systems founded in 2005 acquired by Teradata in 2011; A parallel, shared-nothing, MapReduce-based data processing system (next lecture). SQL on top of MapReduce Netezza founded in 2000 and acquired by IBM in 2010. A parallel, shared-nothing DBMS. Great time to be in the data management, data mining/statistics, or machine learning!

Two Approaches to Parallel Data Processing Parallel databases, developed starting with the 80s (this lecture) OLTP (Online Transaction Processing) OLAP (Online Analytic Processing, or Decision Support) MapReduce, first developed by Google, published in 2004 (next lecture) Only for Decision Support Queries Today we see convergence of the two approaches (Greenplum,Dremmel) CSE 344 - Winter 2013 12

Parallel DBMSs Goal Improve performance by executing multiple operations in parallel Key benefit Cheaper to scale than relying on a single increasingly more powerful processor Key challenge Ensure overhead and contention do not kill performance CSE 344 - Winter 2013 13

Performance Metrics for Parallel DBMSs P = the number of nodes (processors, computers) Speedup: More nodes, same data è higher speed Scaleup: More nodes, more data è same speed OLTP: Speed = transactions per second (TPS) Decision Support: Speed = query time CSE 344 - Winter 2013 14

Linear v.s. Non-linear Speedup Speedup 1 5 10 15 # nodes (=P) CSE 344 - Winter 2013 15

Linear v.s. Non-linear Scaleup Batch Scaleup Ideal 1 5 10 15 # nodes (=P) AND data size CSE 344 - Winter 2013 16

Challenges to Linear Speedup and Scaleup Startup cost Cost of starting an operation on many nodes Interference Contention for resources between nodes Skew Slowest node becomes the bottleneck CSE 344 - Winter 2013 17

Architectures for Parallel Shared memory Databases Shared disk Shared nothing CSE 344 - Winter 2013 18

Shared Memory P P P Interconnection Network Global Shared Memory D D D CSE544 - Spring, 2012 19

Shared Disk M M M P P P Interconnection Network D D D CSE544 - Spring, 2012 20

Shared Nothing Interconnection Network P P P M M M D D D CSE544 - Spring, 2012 21

A Professional Picture From: Greenplum Database Whitepaper SAN = Storage Area Network CSE 344 - Winter 2013 22

Shared Memory Nodes share both RAM and disk Dozens to hundreds of processors Example: SQL Server runs on a single machine and can leverage many threads to get a query to run faster (see query plans) Easy to use and program But very expensive to scale: last remaining cash cows in the hardware industry CSE 344 - Winter 2013 23

Shared Disk All nodes access the same disks Found in the largest "single-box" (noncluster) multiprocessors Oracle dominates this class of systems. Characteristics: Also hard to scale past a certain point: existing deployments typically have fewer than 10 machines CSE 344 - Winter 2013 24

Shared Nothing Cluster of machines on high-speed network Called "clusters" or "blade servers Each machine has its own memory and disk: lowest contention. NOTE: Because all machines today have many cores and many disks, then shared-nothing systems typically run many "nodes on a single physical machine. Characteristics: Today, this is the most scalable architecture. Most difficult to administer and tune. We discuss only Shared Nothing in class 25

In Class You have a parallel machine. Now what? How do you speed up your DBMS? CSE 344 - Winter 2013 26

Approaches to Parallel Query Evaluation cid=cid Inter-query parallelism Transaction per node OLTP cid=cid cid=cid pid=pid pid=pid Customer pid=pid Customer Customer Product Purchase Product Purchase Product Purchase CSE 344 - Winter 2013

Approaches to Parallel Query Evaluation cid=cid Inter-query parallelism Transaction per node OLTP cid=cid cid=cid pid=pid pid=pid Customer pid=pid Customer Customer Product Purchase Product Purchase Product Purchase Inter-operator parallelism Operator per node Both OLTP and Decision Support cid=cid pid=pid Customer Product Purchase CSE 344 - Winter 2013

Approaches to Parallel Query Evaluation Inter-query parallelism Transaction per node OLTP Product pid=pid cid=cid Purchase pid=pid cid=cid Customer Product Product Purchase pid=pid cid=cid Customer Purchase Customer Inter-operator parallelism Operator per node Both OLTP and Decision Support Intra-operator parallelism Operator on multiple nodes Decision Support Product cid=cid pid=pid Customer Purchase cid=cid pid=pid Customer Product Purchase CSE 344 - Winter 2013

Approaches to Parallel Query Evaluation Inter-query parallelism Transaction per node OLTP Product pid=pid cid=cid Purchase pid=pid cid=cid Customer Product Product Purchase pid=pid cid=cid Customer Purchase Customer Inter-operator parallelism Operator per node Both OLTP and Decision Support Intra-operator parallelism Operator on multiple nodes Decision Support Product cid=cid pid=pid Customer Purchase cid=cid pid=pid Customer Product Purchase We study only intra-operator CSE 344 - Winter parallelism: 2013 most scalable

Review in Class Basic query processing on one node. Given relations R(A,B) and S(B, C), compute: Selection: σ A=123 (R) Group-by: γ A,sum(B) (R) Join: R S CSE 344 - Winter 2013 31

Horizontal Data Partitioning Have a large table R(K, A, B, C) Need to partition on a shared-nothing architecture into P chunks R 1,, R P, stored at the P nodes Block Partition: size(r 1 ) size(r P ) Hash partitioned on attribute A: Tuple t goes to chunk i, where i = h(t.a) mod P + 1 Range partitioned on attribute A: Partition the range of A into - = v 0 < v 1 < < v P = Tuple t goes to chunk i, if v i-1 < t.a < v i CSE 344 - Winter 2013 32

Parallel GroupBy R(K,A,B,C), discuss in class how to compute these GroupBy s, for each of the partitions γ A,sum(C) (R) γ B,sum(C) (R) CSE 344 - Winter 2013 33

Parallel GroupBy γ A,sum(C) (R) If R is partitioned on A, then each node computes the group-by locally Otherwise, hash-partition R(K,A,B,C) on A, then compute group-by locally: R 1 R 2... R P Reshuffle R on attribute A R 1 R 2... R P CSE 344 - Winter 2013 34

Speedup and Scaleup The runtime is dominated by the time to read the chunks from disk, i.e. size(r i ) If we double the number of nodes P, what is the new running time of γ A,sum(C) (R)? If we double both P and the size of the relation R, what is the new running time? CSE 344 - Winter 2013 35

Uniform Data v.s. Skewed Data Uniform partition: size(r 1 ) size(r P ) size(r) / P Linear speedup, constant scaleup Skewed partition: For some i, size(r i ) size(r) / P Speedup and scaleup will suffer CSE 344 - Winter 2013 36

Uniform Data v.s. Skewed Data Let R(K,A,B,C); which of the following partition methods may result in skewed partitions? Block partition Hash-partition On the key K On the attribute A Range-partition On the key K On the attribute A CSE 344 - Winter 2013 37

Uniform Data v.s. Skewed Data Let R(K,A,B,C); which of the following partition methods may result in skewed partitions? Block partition Hash-partition On the key K On the attribute A Range-partition On the key K On the attribute A Uniform Uniform May be skewed May be skewed Assuming uniform hash function E.g. when all records have the same value of the attribute A, then all records end up in the same partition Difficult to partition the range of A uniformly. CSE 344 - Winter 2013 38

Parallel Join In class: compute R(A,B) S(B,C) R 1, S 1 R 2, S 2... R P, S P CSE 344 - Winter 2013 39

Parallel Join In class: compute R(A,B) S(B,C) R 1, S 1 R 2, S 2... R P, S P Reshuffle R on R.B and S on S.B R 1, S 1 R 2, S 2... R P, S P Each server computes the join locally CSE 344 - Winter 2013 40

Amazon Warning We HIGHLY recommend you remind students to turn off any instances after each class/session as this can quickly diminish the credits and start charging the card on file. You are responsible for the overages. AWS customers can now use billing alerts to help monitor the charges on their AWS bill. You can get started today by visiting your Account Activity page to enable monitoring of your charges. Then, you can set up a billing alert by simply specifying a bill threshold and an e-mail address to be notified as soon as your estimated charges reach the threshold. CSE 344 - Winter 2013 41