Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Similar documents
Introduction to Data Management CSE 344

Introduction to Database Systems CSE 414

Introduction to Data Management CSE 344

CSE 344 MAY 2 ND MAP/REDUCE

Introduction to Database Systems CSE 414. Lecture 16: Query Evaluation

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

CSE 544: Principles of Database Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce

Why compute in parallel?

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Database Management Systems CSEP 544. Lecture 6: Query Execution and Optimization Parallel Data processing

Introduction to Data Management CSE 344

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

Huge market -- essentially all high performance databases work this way

COURSE 12. Parallel DBMS

Hadoop vs. Parallel Databases. Juliana Freire!

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcements. Two typical kinds of queries. Choosing Index is Not Enough. Cost Parameters. Cost of Reading Data From Disk

CSE 344 Final Review. August 16 th

CSE 544, Winter 2009, Final Examination 11 March 2009

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

CSE 344 MAY 7 TH EXAM REVIEW

CSC 261/461 Database Systems Lecture 19

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CSE 344 Final Examination

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Database Systems CSE 414

CSE 190D Spring 2017 Final Exam Answers

CSE 414: Section 7 Parallel Databases. November 8th, 2018

Chapter 17: Parallel Databases

Lecture 23 Database System Architectures

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

What is Data Warehouse like

Database Applications (15-415)

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Evaluation of relational operations

Database Systems CSE 414

Architecture and Implementation of Database Systems (Winter 2014/15)

CSE 190D Spring 2017 Final Exam

CS-460 Database Management Systems

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

MapReduce. Stony Brook University CSE545, Fall 2016

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Accelerating Analytical Workloads

Announcements. From SQL to RA. Query Evaluation Steps. An Equivalent Expression

Database Systems CSE 414

Chapter 20: Database System Architectures

Parallel DBs. April 23, 2018

Parallel DBMS. Chapter 22, Part A

CompSci 516: Database Systems

University of Waterloo Midterm Examination Sample Solution

OS and Hardware Tuning

Crescando: Predictable Performance for Unpredictable Workloads

Parallel DBs. April 25, 2017

The Evolution of Big Data Platforms and Data Science

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

OS and HW Tuning Considerations!

Data Partitioning and MapReduce

Advanced Databases: Parallel Databases A.Poulovassilis

CSE 344 FEBRUARY 14 TH INDEXING

2.3 Algorithms Using Map-Reduce

Database Architectures

CS352 Lecture: Database System Architectures last revised 11/22/06

Optimization Overview

data parallelism Chris Olston Yahoo! Research

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

EECS 647: Introduction to Database Systems

WHITEPAPER. MemSQL Enterprise Feature List

CSE 344 Final Examination

CompSci 516 Data Intensive Computing Systems

Implementation of Relational Operations

Greenplum Architecture Class Outline

Database Applications (15-415)

Data Storage. Query Performance. Index. Data File Types. Introduction to Data Management CSE 414. Introduction to Database Systems CSE 414

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Implementation of Relational Operations: Other Operations

Scalable Tools - Part I Introduction to Scalable Tools

HyPer-sonic Combined Transaction AND Query Processing

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

CSE 344 JULY 30 TH DB DESIGN (CH 4)

CSE 344 APRIL 20 TH RDBMS INTERNALS

Transcription:

Announcements Database Systems CSE 414 HW4 is due tomorrow 11pm Lectures 18: Parallel Databases (Ch. 20.1) 1 2 Why compute in parallel? Multi-cores: Most processors have multiple cores This trend will increase in the future Big data: too large to fit in main memory Distributed query processing on 100-1000 servers Widely available now using cloud services Big Data Companies, organizations, scientists have data that is too big (and sometimes too complex) to be managed without changing tools and processes Complex data processing: Decision support queries (SQL w/ aggregates) Machine learning (adds linear algebra and iteration) 3 4 Two Kinds of Parallel Data Processing Parallel databases, developed starting with the 80s (this lecture) OLTP (Online Transaction Processing) OLAP(Online Analytic Processing, or Decision Support) General purpose distributed processing: MapReduce, Spark Mostly for Decision Support Queries Performance Metrics for Parallel DBMSs P = the number of nodes (processors, computers) Speedup: More nodes, same data higher speed Scaleup: More nodes, more data same speed OLTP: Speed = transactions per second (TPS) Decision Support: Speed = query time 5 6 1

Linear vs. Non-linear Speedup Linear vs. Non-linear Scaleup Speedup Batch Scaleup Ideal 1 5 10 15 1 5 10 15 # nodes (=P) 7 # nodes (=P) AND data size 8 Challenges to Linear Speedup and Scaleup Startup cost Cost of starting an operation on many nodes Interference Contention for resources between nodes Stragglers Slowest node becomes the bottleneck Architectures for Parallel Databases Shared memory Shared disk Shared nothing 9 10 Shared Memory P P P Interconnection Network Global Shared Memory Shared Disk M M M P P P Interconnection Network D D D D D D 11 12 2

Shared Nothing A Professional Picture Interconnection Network P P P M M M D D D From: Greenplum (now EMC) Database Whitepaper SAN = Storage Area Network 13 14 Shared Memory Shared Disk Nodes share both RAM and disk Dozens to hundreds of processors Example: SQL Server runs on a single machine and can leverage many threads to get a query to run faster (see query plans) Easier to program and easy to use But very expensive to scale: last remaining cash cows in the hardware industry All nodes access the same disks Found in the largest "single-box" (noncluster) multiprocessors Oracle dominates this class of systems. Characteristics: Also hard to scale past a certain point: existing deployments typically have fewer than 10 machines 15 16 Shared Nothing Cluster of machines on high-speed network Each machine has its own memory and disk: lowest contention Approaches to Parallel Query Evaluation Inter-query parallelism Transaction per node OLTP Product Product Purchase Product Purchase Purchase NOTE: Because all machines today have many cores and many disks, then shared-nothing systems typically run many "nodes on a single physical machine. Inter-operator parallelism Operator per node Both OLTP and Decision Support Characteristics: Today, this is the most scalable architecture. Most difficult to administer and tune. We discuss only Shared Nothing in class 17 Intra-operator parallelism Operator on multiple nodes Decision Support We study only intra-operator parallelism: most scalable Product Purchase Product Purchase 18 3

Single Node Query Processing (Review) Given relations R(A, B) and S(B, C), no indexes: Selection: σ A=123 (R) Scan file R, select records with A=123 Group-by: γ A, sum(b) (R) Scan file R, insert into a table using attr. A as key When a new key is equal to an existing one, add B to the value Distributed Query Processing Data is horizontally partitioned across many servers Operators may require data reshuffling not all the needed data is in one place Join: R S Scan file S, insert into a table using attr. B as key Scan file R, probe the table using attr. B 19 20 Horizontal Data Partitioning Horizontal Data Partitioning Data: Servers: Data: Servers: 1 2... P 1 2... P Which tuples go to what server? 21 22 Horizontal Data Partitioning Block Partition: Partition tuples arbitrarily s.t. size(r 1 ) size(r P ) Hash partitioned on attribute A: Tuple t goes to chunk i, where i = h(t.a) mod P + 1 Range partitioned on attribute A: Partition the range of A into - = v 0 < v 1 < < v P = Tuple t goes to chunk i, if v i-1 < t.a < v i Data: R(K, A, B, C) Query: γ A, sum(c) (R) Parallel GroupBy How can we compute in each case? R is -partitioned on A R is block-partitioned R is -partitioned on K easy case! 23 24 4

Parallel GroupBy Data: R(K, A, B, C) Query: γ A, sum(c) (R) R is block-partitioned or -partitioned on K R 1 R 2... R P Parallel Join Data: R(K1, A, B), S(K2, B, C) Query: R(K1, A, B) S(K2, B, C) Reshuffle R on R.B and S on S.B Initially, both R and S are horizontally partitioned on K1 and K2 R 1, S 1 R 2, S 2... R P, S P Reshuffle R on attribute A R 1 R 2 R P... Each server computes the join locally R 1, S 1 R 2, S 2... R P, S P 25 26 Data: R(K1, A, B), S(K2, B, C) Query: R(K1, A, B) S(K2, B, C) R1 S1 R2 S2 K1 B K2 B K1 B K2 B Partition 1 20 101 50 3 20 201 20 2 50 102 50 4 20 202 50 Shuffle Local Join M1 R1 S1 R2 S2 K1 B K2 B K1 B K2 B 1 20 201 20 2 50 101 50 3 20 102 50 4 20 M1 M2 202 50 M2 Speedup and Scaleup Consider: Query: γ A, sum(c) (R) Runtime: dominated by reading chunks from disk If we double the number of nodes P, what is the new running time? Half (each server holds ½ as many chunks) If we double both P and the size of R, what is the new running time? Same (each server holds the same # of chunks) 27 28 Uniform Data vs. Skewed Data Let R(K, A, B, C); which of the following partition methods may result in skewed partitions? Block partition Uniform Loading Data into a Parallel DBMS Example using Teradata System Need to figure out where it belongs Hash-partition On the key K On the attribute A Uniform Assuming good function May be skewed E.g. when all records have the same value of the attribute A, then all records end up in the same partition AMP = Access Module Processor = unit of parallelism 29 30 5

Example Parallel Query Execution Example Parallel Query Execution o.pid = p.pid date = today() Find all orders from today, along with the items ordered SELECT * FROM, Product p WHERE o.pid = p.pid AND o.date = today() Product p o.pid = p.pid date = today() Order o h(o.pid) select date=today() h(o.pid) select date=today() h(o.pid) select date=today() 31 32 Example Parallel Query Execution Item i o.pid = p.pid date = today() Example Parallel Query Execution o.pid = p.pid o.pid = p.pid o.pid = p.pid h(p.pid) Product p h(p.pid) Product p h(p.pid) Product p contains all orders and all lines where (pid) = 3 contains all orders and all lines where (pid) = 2 contains all orders and all lines where (pid) = 1 33 34 6