Parallel DBMS. Chapter 22, Part A

Similar documents
CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

COURSE 12. Parallel DBMS

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Course Content. Parallel & Distributed Databases. Objectives of Lecture 12 Parallel and Distributed Databases

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations

Evaluation of relational operations

Evaluation of Relational Operations: Other Techniques

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Advanced Databases: Parallel Databases A.Poulovassilis

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 12: Query Processing. Chapter 12: Query Processing

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Evaluation of Relational Operations. Relational Operations

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Chapter 17: Parallel Databases

Chapter 12: Query Processing

Architecture and Implementation of Database Systems (Winter 2014/15)

Module 10: Parallel Query Processing

Chapter 13: Query Processing

Implementation of Relational Operations

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

CS-460 Database Management Systems

Parallel DBs. April 23, 2018

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Implementation of Relational Operations: Other Operations

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations

Query Evaluation Overview, cont.

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Advanced Database Systems

Huge market -- essentially all high performance databases work this way

Midterm Review CS634. Slides based on Database Management Systems 3 rd ed, Ramakrishnan and Gehrke

Evaluation of Relational Operations

Bridging the Processor/Memory Performance Gap in Database Applications

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Query Evaluation Overview, cont.

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Database System Concepts

Parallel DBs. April 25, 2017

Parallel Query Optimisation

Database Applications (15-415)

What happens. 376a. Database Design. Execution strategy. Query conversion. Next. Two types of techniques

EXTERNAL SORTING. Sorting

Question 1. Part (a) [2 marks] what are different types of data independence? explain each in one line. Part (b) CSC 443 Term Test Solutions Fall 2017

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Database Management System

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Query Processing & Optimization

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CS 525 Advanced Database Organization - Spring 2017 Mon + Wed 1:50-3:05 PM, Room: Stuart Building 111

Weaving Relations for Cache Performance

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

CSE 544: Principles of Database Systems

CAS CS 460/660 Introduction to Database Systems. Query Evaluation II 1.1

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Database Architectures

Implementation of Relational Operations. Introduction. CS 186, Fall 2002, Lecture 19 R&G - Chapter 12

CompSci 516 Data Intensive Computing Systems

data parallelism Chris Olston Yahoo! Research

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

CSC 261/461 Database Systems Lecture 19

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Module 9: Selectivity Estimation

Review. Support for data retrieval at the physical level:

Introduction to Data Management CSE 344

DBMS Query evaluation

Hash table example. B+ Tree Index by Example Recall binary trees from CSE 143! Clustered vs Unclustered. Example

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

CSE 344 MAY 7 TH EXAM REVIEW

Database Applications (15-415)

QUERY EXECUTION: How to Implement Relational Operations?

(Storage System) Access Methods Buffer Manager

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Introduction to Database Systems CSE 414

Advances in Data Management Query Processing and Query Optimisation A.Poulovassilis

A high performance database kernel for query-intensive applications. Peter Boncz

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Introduction to Database Systems CSE 414. Lecture 26: More Indexes and Operator Costs

Principles of Data Management. Lecture #9 (Query Processing Overview)

Hadoop vs. Parallel Databases. Juliana Freire!

University of Waterloo Midterm Examination Sample Solution

Sandor Heman, Niels Nes, Peter Boncz. Dynamic Bandwidth Sharing. Cooperative Scans: Marcin Zukowski. CWI, Amsterdam VLDB 2007.

Crescando: Predictable Performance for Unpredictable Workloads

ECS 165B: Database System Implementa6on Lecture 7

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Weaving Relations for Cache Performance

Database Architectures

Transcription:

Parallel DBMS Chapter 22, Part A Slides by Joe Hellerstein, UCB, with some material from Jim Gray, Microsoft Research. See also: http://www.research.microsoft.com/research/barc/gray/pdb95.ppt Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke

Why Parallel Access To Data? At 10 MB/s 1.2 days to scan 1,000 x parallel 1.5 minute to scan. Bandwidth 1 Terabyte 1 Terabyte 10 MB/s Parallelism: divide a big problem into many smaller ones to be solved in parallel. Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke

Parallel DBMS: Intro Parallelism is natural to DBMS processing Pipeline parallelism: many machines each doing one step in a multi-step process. Partition parallelism: many machines doing the same thing to different pieces of data. Both are natural in DBMS! Pipeline Any Sequential Program Any Sequential Program Partition Sequential Sequential Any Any Sequential Sequential Program Program outputs split N ways, inputs merge M ways Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke

DBMS: The Success Story DBMSs are the most (only?) successful application of parallelism. Teradata, Tandem vs. Thinking Machines, KSR.. Every major DBMS vendor has some server Workstation manufacturers now depend on DB server sales. Reasons for success: Bulk-processing (= partition -ism). Natural pipelining. Inexpensive hardware can do the trick! Users/app-programmers don t need to think in Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke

Some Terminology Speed-Up More resources means proportionally less time for given amount of data. Xact/sec. (throughput) Ideal degree of -ism Scale-Up If resources increased in proportion to increase in data size, time is constant. sec./xact (response time) Ideal degree of -ism Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke

Architecture Issue: Shared What? Shared Memory (SMP) CLIENTS Processors Memory Shared Disk CLIENTS Shared Nothing (network) CLIENTS Easy to program Expensive to build Difficult to scaleup Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke Hard to program Cheap to build Easy to scaleup Sequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2

What Systems Work This Way Shared Nothing Teradata: 400 nodes Tandem: 110 nodes IBM / SP2 / DB2: 128 nodes Informix/SP2 48 nodes ATT & Sybase? nodes Shared Disk Oracle 170 nodes DEC Rdb 24 nodes CLI ENTS CLI ENTS Shared Memory Informix 9 nodes RedBrick? nodes CLI ENTS Processors Memory Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke

Different Types of DBMS -ism Intra-operator parallelism get all machines working to compute a given operation (scan, sort, join) Inter-operator parallelism each operator may run concurrently on a different site (exploits pipelining) Inter-query parallelism different queries run on different sites We ll focus on intra-operator -ism Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke

Automatic Data Partitioning Partitioning a table: Range Hash Round Robin A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z Good for equijoins, range queries group-by Good for equijoins Good to spread load Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke

Parallel Scans Scan in parallel, and merge. Selection may not require all sites for range or hash partitioning. Indexes can be built at each partition. Question: How do indexes differ in the different schemes? Think about both lookups and inserts! What about unique indexes? Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1

Parallel Sorting Current records: 8.5 Gb/minute, shared-nothing; Datamation benchmark in 2.41 secs (UCB students! http://now.cs.berkeley.edu/nowsort/) Idea: Scan in parallel, and range-partition as you go. As tuples come in, begin local sorting on each Resulting data is sorted, and range-partitioned. Problem: skew! Solution: sample the data at start to determine partition points. Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1

Parallel Aggregates For each aggregate function, need a decomposition: count(s) = SUM count(s(i)), ditto for sum() avg(s) = (SUM sum(s(i))) / SUM count(s(i)) and so on... For groups: Sub-aggregate groups close to the source. Pass each sub-aggregate to its group s site. Chosen via a hash fn. Count Count Count Count Count Count A...E F...J K...N O...S T...Z Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1 Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey A Table

Parallel Joins Nested loop: Each outer tuple must be compared with each inner tuple that might join. Easy for range partitioning on join cols, hard otherwise! Sort-Merge (or plain Merge-Join): Sorting gives range-partitioning. But what about handling 2 skews? Merging partitioned tables is local. Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1

Parallel Hash Join Phase 1 Original Relations (R then S)... INPUT hash function h OUTPUT 1 2 B-1 Partitions 1 2 B-1 Disk B main memory buffers Disk In frst phase, partitions get distributed to different sites: A good hash function automatically distributes work evenly! Do second phase at each site. Almost always the winner for equi-join. Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1

Datafow Network for Join Good use of split/merge makes it easier to build parallel versions of sequential join code. Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1

Complex Parallel Query Plans Complex Queries: Inter-Operator parallelism Pipelining between operators: note that sort and phase 1 of hash-join block the pipeline!! Bushy Trees Sites 1-8 Sites 1-4 Sites 5-8 A B R S Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1

N M-way Parallelism Merge Mer ge Merge Sort Sort Sort Sort Sort Join Join Join Join Join A... E F...J K...N O...S T...Z N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1

Observations It is relatively easy to build a fast parallel query executor S.M.O.P. It is hard to write a robust and world-class parallel query optimizer. There are many tricks. One quickly hits the complexity barrier. Still open research! Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1

Parallel Query Optimization Common approach: 2 phases Pick best sequential plan (System R algorithm) Pick degree of parallelism based on current system parameters. Bind operators to processors Take query tree, decorate as in previous picture. Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 1

What s Wrong With That? Best serial plan!= Best plan! Why? Trivial counter-example: Table partitioned with local secondary index at two nodes Range query: all of node 1 and 1% of node 2. Node 1 should do a scan of its partition. Node 2 should use secondary index. SELECT * FROM telephone_book WHERE name < NoGood ; Table Scan A..M Index Scan N..Z Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 2

Parallel DBMS Summary -ism natural to query processing: Both pipeline and partition -ism! Shared-Nothing vs. Shared-Mem Shared-disk too, but less standard Shared-mem easy, costly. Doesn t scaleup. Shared-nothing cheap, scales well, harder to implement. Intra-op, Inter-op, & Inter-query -ism all possible. Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 2

DBMS Summary, cont. Data layout choices important! Most DB operations can be done partition- Sort. Sort-merge join, hash-join. Complex plans. Allow for pipeline- ism, but sorts, hashes block the pipeline. Partition -ism acheived via bushy trees. Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 2

DBMS Summary, cont. Hardest part of the equation: optimization. 2-phase optimization simplest, but can be ineffective. More complex schemes still at the research stage. We haven t said anything about Xacts, logging. Easy in shared-memory architecture. Takes some care in shared-nothing. Database Management Systems, 2 nd Edition. Raghu Ramakrishnan and Johannes Gehrke 2