Module 10: Parallel Query Processing

Similar documents
Chapter 17: Parallel Databases

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Parallel Databases C H A P T E R18. Practice Exercises

Advanced Database Systems

Parallel Query Optimisation

Module 4: Tree-Structured Indexing

Parallel DBMS. Chapter 22, Part A

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

University of Waterloo Midterm Examination Sample Solution

Chapter 13: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing

Query Processing and Advanced Queries. Query Optimization (4)

Advanced Databases: Parallel Databases A.Poulovassilis

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Database System Concepts

Chapter 12: Query Processing

Query Processing. Solutions to Practice Exercises Query:

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Architecture and Implementation of Database Management Systems

Chapter 12: Query Processing

Evaluation of Relational Operations. Relational Operations

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Module 9: Selectivity Estimation

Overview of Query Evaluation. Overview of Query Evaluation

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

Datenbanksysteme II: Implementing Joins. Ulf Leser

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Overview of Query Evaluation

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Database Management System

Module 9: Query Optimization

Outline. Query Processing Overview Algorithms for basic operations. Query optimization. Sorting Selection Join Projection

Overview of Implementing Relational Operators and Query Evaluation

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Implementation of Relational Operations

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Query Processing & Optimization

Implementing Relational Operators: Selection, Projection, Join. Database Management Systems, R. Ramakrishnan and J. Gehrke 1

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

Query Evaluation Overview, cont.

Evaluation of Relational Operations: Other Techniques

3.1.1 Cost model Search with equality test (A = const) Scan

CS 525 Advanced Database Organization - Spring 2017 Mon + Wed 1:50-3:05 PM, Room: Stuart Building 111

Query Evaluation Overview, cont.

High-Performance Parallel Database Processing and Grid Databases

Chapter 12: Indexing and Hashing. Basic Concepts

Evaluation of relational operations

Evaluation of Relational Operations

Load Balancing for Parallel Query Execution on NUMA Multiprocessors

Relational Query Optimization. Overview of Query Evaluation. SQL Refresher. Yanlei Diao UMass Amherst October 23 & 25, 2007

Huge market -- essentially all high performance databases work this way

Chapter 12: Indexing and Hashing

data parallelism Chris Olston Yahoo! Research

Reminders. Query Optimizer Overview. The Three Parts of an Optimizer. Dynamic Programming. Search Algorithm. CSE 444: Database Internals

Query processing and optimization

Using A Network of workstations to enhance Database Query Processing Performance

Chapter 11: Indexing and Hashing

Module 5: Hash-Based Indexing

Question 1. Part (a) [2 marks] what are different types of data independence? explain each in one line. Part (b) CSC 443 Term Test Solutions Fall 2017

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

Chapter 18: Parallel Databases

Chapter 18 Strategies for Query Processing. We focus this discussion w.r.t RDBMS, however, they are applicable to OODBS.

Implementing Joins 1

Architecture and Implementation of Database Systems (Winter 2014/15)

Join Algorithms. Lecture #12. Andy Pavlo Computer Science Carnegie Mellon Univ. Database Systems / Fall 2018

CSE 544, Winter 2009, Final Examination 11 March 2009

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Relational Query Optimization. Highlights of System R Optimizer

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Assignment No: Create a College database and apply different queries on it. 2. Implement GUI for SQL queries and display result of the query

CS330. Query Processing

Teaching Scheme Business Information Technology/Software Engineering Management Advanced Databases

Query Optimization. Query Optimization. Optimization considerations. Example. Interaction of algorithm choice and tree arrangement.

Principles of Data Management. Lecture #9 (Query Processing Overview)

Fundamentals of Database Systems

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

CSE 444: Database Internals. Lecture 11 Query Optimization (part 2)

Outline. Database Tuning. Join Strategies Running Example. Outline. Index Tuning. Nikolaus Augsten. Unit 6 WS 2014/2015

1.1 - Basics of Query Processing in SQL Server

Review. Support for data retrieval at the physical level:

CS-460 Database Management Systems

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 7 - Query execution

DBMS Query evaluation

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department.

Principles of Data Management. Lecture #12 (Query Optimization I)

CSE 344 MAY 7 TH EXAM REVIEW

Transcription:

Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Buffer Disk Space Module 10: Parallel Query Processing Module Outline We are here... 10.1 Objectives in Parallelizing s 10.2 Speed-Up and Scale-Up 10.3 Opportunities for Parallelization in Rs 10.4 Examples for parallel query execution plans 325

10.1 Objectives in Parallelizing s Thus far, we have (implicitly or explicitly) been considering a in a client-server architecture: One server operates on the data stored on its local disks, on behalf of any of the numerous clients issuing requests to the server over a (local or global) network. All the data-intensive work, as well as, e.g., transaction management, is done on the (single) server. The following considerations may lead us to parallel or distributed architectures: High-performance. A single server, implemented on a sequential single-processor machine may not be able to provide the necessary performance (repsonse-time, throughput). High-availability. A single server represents a single point-of-failure. Hardware or software problems as well as network disconnection turn all database operations down. Extensibility. Accomodating increasing demands in terms of database size and/or performance will definitely hit hard limits in a single-server architecture. 326

Architectures for Parallel s Typical parallel architectures include: Shared Memory: Shared Disk: Shared Nothing: Local Memory Local Memory Local Memory Local Memory Local Memory Local Memory Global Shared Memory Each of these has its own advantages and potential problems. For example, shared memory is easiest to program, while shared nothing scales best. 327

10.2 Speed-Up and Scale-Up Speed-Up. Given a constant problem size (e.g., database size and transaction load), how does the performance (e.g., response time) increase with increased hardware ressources (e.g., number of processors and/or disks)? Scale-Up. When the problem size increases, can we achieve the same performance with hardware ressources increased correspondingly? Overview of metrics for parallel processing Problem/ Size constant variable Ressource constant Size-Up Utilization variable Speed-Up Scale-Up 328

10.2.1 Problems with speed-up Considering the response-time performance indicator, speed-up (w.r.t. increased number of processors used) is defined as rt-speed-up(n) = reponse-time with 1 processor response-time with n processors Similarly, we can use the throughput (number of transactions per second) indicator and define throughput with n processors tput-speed-up(n) = throughput with 1 processor Problem: we can not achieve linear speed-up, rather... ideal optimal real Amdahl s Law: 1 Speed-Up seq. part + par.part /n e.g., start-up and synchr. overhead, sub-optimal load-balancing,... 329

10.3 Opportunities for Parallelization in Rs Relational s offer a large potential to exploit parallelism: Data parallelism. Queries operate on large data sets. Data sets can be partitioned and each partition can be handled by a separate, parallel thread. Challenge: avoid skew in partitioning the data. Pipelined parallelism. Queries consist of (pipelined) sequences of operators. Each operator can be executed by a separate, pipelined thread. See our earlier discussion on pipelining. Operator parallelism. For many operators, their internal execution algorithms can be parallelized into several threads. For example, parallel join-algorithms. 330

Different kinds of parallelism Depending on what is performed in parallel, the following systematics have been developed: Inter-transaction parallelism: several transaction are run in parallel. (This is the standard in all s.) Intra-transaction p.: Inter-query p.: several queries within a transaction are run in parallel (needs asynchronous SQL-I/F). 20.4.1 Intra-query Formen p.: within der Parallelität one SQL-call, multiple tasks are run in parallel Inter-operator p.: operators constituting a query are run in parallel... was Intra-operator wird parallel ausgeführt? p.: a single operator is implemented via a parallel algorithm BOT Select... Select... Insert... EOT BOT Select... Select... Insert... EOT BOT EOT BOT EOT Interquery-Parallelität Intraquery-Parallelität/ Intraoperator-Parallelität 331

10.4 Examples for parallel query execution plans 10.4.1 Parallel join algorithms There are a number of parallel join algorithms. nested loops join (aka. broadcast join): The most simple one is parallel 1. Partitioning phase: broadcast records of outer relation to nodes holding inner. 2. Join phase: locally compute (partial) joins on nodes holding inner. S R R S S S R R S S This algorithm can be used for non-equi joins, too. 332

Parallel associative join If the inner relation (S) is stored in partitions (according to the join attributes) and the join is an equi-join, then we can 1. distribute puter tuples to the matching partition of the inner, 2. compute the (partial) joins locally on the nodes storing the inner partitions. S R R S S S R R S S 333

Parallel (simple) hash join 1. Partition outer (R) using some hash function h, send records to join node indicated by hash value. 2. Partition inner (S) using same hash function h, send records to join node indicated by hash value. 3. Locally compute (partial) joins on all join nodes. S R R S S S R R S S N.B.: a node can be scan and join node at the same time. 334

Parallel asymmetric hash join (see earlier discussion of hash joins.) 1. Building phase: scan and distribute outer according to some hash function h. 2. Probing phase: combine scan/distribute of inner with locally computing the join. S R R S S S R R S S N.B.: again, a node can be scan and join node at the same time. 335

Parallel hybrid hash join (see earlier discussion of hybrid hash joins.) 1. Building phase: scan and distribute outer according to some hash function h, keep first bucket in memory. 2. Probing phase: combine scan/distribute of inner with locally computing the join for the first bucket. S R R S S S R R S S 336

10.4.2 Parallelizing join trees Left-deep join trees can be fully pipelined. As such, they offer good potential for (pipelining) inter-operator parallelism. When we consider parallel hash joins, though, we observe that each building phase falls within its own, sequential execution phase: Example: consider the join of four relations R 1, R 2, R 3, R 4 and the left-deep join tree shown below; each join is implemented as an asymmetric hash join. J3 B3 P3 J2 S4 implem ented by B2 P2 S4 J1 S3 B1 P1 S3 S1 S2 S = Scan J= Join B = Build P= Probe The execution of the query proceeds in 4 sequential steps (the tasks within each step are executed in parallel): 1. {S 1, B 1 } 2. {S 2, P 1, B 2 } 3. {S 3, P 2, B 3 } 4. {S 4, P 3 } S1 S2 337

Analysis of parallelizing left-deep join trees PROs: no more than 2 hash tables have to kept in memory at the same time probing relation is always a base table CONs: rather limited degree of parallelism size of hash tables (build phase) depends on join selectivity (difficult to estimate accurately) 338

Parallelizing right-deep join trees Example: consider the same join of four relations R 1, R 2, R 3, R 4 as before, but now look at the right-deep join tree shown below; each join is again implemented as an asymmetric hash join. J3 B3 P3 S4 J2 implem ented by S4 B2 P2 S3 J1 S3 B1 P1 S2 S1 S2 S1 Now, the execution of the query can be split into only 2 sequential steps (parallelizing the tasks within each step): 1. {S 2, B 1, S 3, B 2, S 4, B 3 } 2. {S 1, P 1, P 2, P 3 } more parallelism (parallel scans, all probing phases in a single pipeline) all build-relations are base tables, hence better size-estimates much higher memory requirements (all build-tables) 339

Bibliography Özsu, M. and Valduriez, P. (1991). Principles of Distributed Systems. Prentice Hall. Rahm, E. (1994). Mehrrechner-Datenbanksysteme Grundlagen der verteilten und parallelen Datenbankverarbeitung. Addison-Wesley, Bonn. Ramakrishnan, R. and Gehrke, J. (2003). Management Systems. McGraw-Hill, New York, 3 edition. 340