Parallel Databases C H A P T E R18. Practice Exercises

Similar documents
! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Chapter 17: Parallel Databases

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Chapter 18: Parallel Databases

Module 10: Parallel Query Processing

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

data parallelism Chris Olston Yahoo! Research

Motivation for B-Trees

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

Indexing and Hashing

CS-460 Database Management Systems

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

University of Waterloo Midterm Examination Solution

Huge market -- essentially all high performance databases work this way

Query Processing. Solutions to Practice Exercises Query:

6 Distributed data management I Hashing

Indexing and Hashing

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

CS 245 Midterm Exam Winter 2014

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 16-1

SCALABLE WEB PROGRAMMING. CS193S - Jan Jannink - 2/02/10

Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC.

Distributed DBMS. Concepts. Concepts. Distributed DBMS. Concepts. Concepts 9/8/2014

Main Memory. Electrical and Computer Engineering Stephen Kim ECE/IUPUI RTOS & APPS 1

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

6.830 Problem Set 3 Assigned: 10/28 Due: 11/30

Associate Professor Dr. Raed Ibraheem Hamed

Architecture and Implementation of Database Systems (Winter 2014/15)

Requirements, Partitioning, paging, and segmentation

Operating Systems Comprehensive Exam. Fall Student ID # 10/31/2013

Course No: 4411 Database Management Systems Fall 2008 Midterm exam

Assignment No: Create a College database and apply different queries on it. 2. Implement GUI for SQL queries and display result of the query

Binary Search. Roland Backhouse February 5th, 2001

Mean Value Analysis and Related Techniques

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

6. Parallel Volume Rendering Algorithms

Power-Aware Routing in Mobile Ad Hoc Networks

Scaling Database Systems. COSC 404 Database System Implementation Scaling Databases Distribution, Parallelism, Virtualization

2, 3, 5, 7, 11, 17, 19, 23, 29, 31

RAID. Redundant Array of Inexpensive Disks. Industry tends to use Independent Disks

CS 222/122C Fall 2017, Final Exam. Sample solutions

Segregating Data Within Databases for Performance Prepared by Bill Hulsizer

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

CS4411 Intro. to Operating Systems Exam 2 Fall 2009

Database Management Systems

COMPUTER SCIENCE 4500 OPERATING SYSTEMS

Overview of Implementing Relational Operators and Query Evaluation

Name: Problem 1 Consider the following two transactions: T0: read(a); read(b); if (A = 0) then B = B + 1; write(b);

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

Principles of Parallel Algorithm Design: Concurrency and Mapping

CS Operating Systems

CS Operating Systems

Memory. Objectives. Introduction. 6.2 Types of Memory

CSC443 Winter 2018 Assignment 1. Part I: Disk access characteristics

Database Architectures

CSIT5300: Advanced Database Systems

Parallel DBs. April 23, 2018

Norwegian University of Science and Technology Technical Report IDI-TR-02/2008. DYTAF: Dynamic Table Fragmentation in Distributed Database Systems

CMPS 181, Database Systems II, Final Exam, Spring 2016 Instructor: Shel Finkelstein. Student ID: UCSC

A Review on Fragmentation Techniques in Distributed Database

CSE 190D Spring 2017 Final Exam Answers

Dynamic Metadata Management for Petabyte-scale File Systems

Operating Systems Comprehensive Exam. Spring Student ID # 2/17/2011

Indexing the Results of Sliding Window Queries

Course Content. Parallel & Distributed Databases. Objectives of Lecture 12 Parallel and Distributed Databases

Two hours UNIVERSITY OF MANCHESTER SCHOOL OF COMPUTER SCIENCE. Date: Thursday 16th January 2014 Time: 09:45-11:45. Please answer BOTH Questions

CSE 124: Networked Services Lecture-15

COURSE 12. Parallel DBMS

Introduction Alternative ways of evaluating a given query using

Novel Address Mappings for Shingled Write Disks

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CS 347 Parallel and Distributed Data Processing

Cost Models for Query Processing Strategies in the Active Data Repository

School of Computing, Engineering and Information Sciences University of Northumbria. Set Operations

Chapter 17: Recovery System

Failure Classification. Chapter 17: Recovery System. Recovery Algorithms. Storage Structure

CPU Scheduling (1) CPU Scheduling (Topic 3) CPU Scheduling (2) CPU Scheduling (3) Resources fall into two classes:

Chapter 17 Indexing Structures for Files and Physical Database Design

Parallel Patterns for Window-based Stateful Operators on Data Streams: an Algorithmic Skeleton Approach

Design of Parallel Algorithms. Models of Parallel Computation

EECS 498 Introduction to Distributed Systems

CSCI 4500 / 8506 Sample Questions for Quiz 5

MapReduce. Stony Brook University CSE545, Fall 2016

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

An Efficient Commit Protocol Exploiting Primary-Backup Placement in a Parallel Storage System. Haruo Yokota Tokyo Institute of Technology

An Object-Oriented Framework for the Parallel Join Operation

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Memory Management. Memory Management Requirements

Today. Dynamic Memory Allocation: Basic Concepts. Dynamic Memory Allocation. Dynamic Memory Allocation. malloc Example. The malloc Package

Transcription:

C H A P T E R18 Parallel Databases Practice Exercises 181 In a range selection on a range-partitioned attribute, it is possible that only one disk may need to be accessed Describe the benefits and drawbacks of this property Answer: If there are few tuples in the queried range, then each query can be processed quickly on a single disk This allows parallel execution of queries with reduced overhead of initiating queries on multiple disks On the other hand, if there are many tuples in the queried range, each query takes a long time to execute as there is no parallelism within its execution Also, some of the disks can become hot-spots, further increasing response time Hybrid range partitioning, in which small ranges (a few blocks each) are partitioned in a round-robin fashion, provides the benefits of range partitioning without its drawbacks 182 What form of parallelism (interquery, interoperation, or intraoperation) is likely to be the most important for each of the following tasks? a Increasing the throughput of a system with many small queries b Increasing the throughput of a system with a few large queries, when the number of disks and processors is large Answer: a When there are many small queries, inter-query parallelism gives good throughput Parallelizing each of these small queries would increase the initiation overhead, without any significant reduction in response time b With a few large queries, intra-query parallelism is essential to get fast response times Given that there are large number of processors and disks, only intra-operation parallelism can take advantage of the parallel hardware for queries typically have 21

22 Chapter 18 Parallel Databases few operations, but each one needs to process a large number of tuples 183 With pipelined parallelism, it is often a good idea to perform several operations in a pipeline on a single processor, even when many processors are available a Explain why b Would the arguments you advanced in part a hold if the machine has a shared-memory architecture? Explain why or why not c Would the arguments in part a hold with independent parallelism? (That is, are there cases where, even if the operations are not pipelined and there are many processors available, it is still a good idea to perform several operations on the same processor?) Answer: a The speed-up obtained by parallelizing the operations would be offset by the data transfer overhead, as each tuple produced by an operator would have to be transferred to its consumer, which is running on a different processor b In a shared-memory architecture, transferring the tuples is very efficient So the above argument does not hold to any significant degree c Even if two operations are independent, it may be that they both supply their outputs to a common third operator In that case, running all three on the same processor may be better than transferring tuples across processors 184 Consider join processing using symmetric fragment and replicate with range partitioning How can you optimize the evaluation if the join condition is of the form ra sb k, where k is a small constant? Here, x denotes the absolute value of x A join with such a join condition is called a band join Answer: Relation r is partitioned into n partitions, r 0, r 1,, r n 1, and s is also partitioned into n partitions, s 0, s 1,, s n 1 The partitions are replicated and assigned to processors as shown below

Practice Exercises 23 s 0 s 1 s 2 s 3 s n 1 r 0 P 0,0 P 0,1 r 1 P 1,0 P 1,1 P 1,2 r 2 P 2,1 P 2,2 P 2,3 r n 1 Pn 1, n 1 Each fragment is replicated on 3 processors only, unlike in the general case where it is replicated on n processors The number of processors required is now approximately 3n, instead of n 2 in the general case Therefore given the same number of processors, we can partition the relations into more fragments with this optimization, thus making each local join faster 185 Recall that histograms are used for constructing load-balanced range partitions a Suppose you have a histogram where values are between 1 and 100, and are partitioned into 10 ranges, 1 10, 11 20,, 91 100, with frequencies 15, 5, 20, 10, 10, 5, 5, 20, 5, and 5, respectively Give a load-balanced range partitioning function to divide the values into 5 partitions b Write an algorithm for computing a balanced range partition with p partitions, given a histogram of frequency distributions containing n ranges Answer: a A partitioning vector which gives 5 partitions with 20 tuples in each partition is: [21, 31, 51, 76] The 5 partitions obtained are 1 20, 21 30, 31 50, 51 75 and 76 100 The assumption made in arriving at this partitioning vector is that within a histogram range, each value is equally likely b Let the histogram ranges be called h 1, h 2,, h h, and the partitions p 1, p 2,, p p Let the frequencies of the histogram ranges be

24 Chapter 18 Parallel Databases n 1, n 2,, n h Each partition should contain N/p tuples, where N = h i=1 n i To construct the load balanced partitioning vector, we need to determine the value of the k th 1 tuple, the value of the kth 2 tuple and so on, where k 1 = N/p, k 2 = 2N/p etc, until k p 1 The partitioning vector will then be [k 1, k 2,,k p 1 ] The value of the ki th tuple is determined as follows First determine the histogram range h j in which it falls Assuming all values in a range are equally likely, the ki th value will be s j + ( e j s j ) k i j n j where s j : first value in h j e j : last value in h j k i j : k i j 1 l=1 n l 186 Large-scale parallel database systems store an extra copy of each data item on disks attached to a different processor, to avoid loss of data if one of the processors fails a Instead of keeping the extra copy of data items from a processor at a single backup processor, it is a good idea to partition the copies of the data items of a processor across multiple processors Explain why b Explain how virtual-processor partitioning can be used to efficiently implement the partitioning of the copies as described above c What are the benefits and drawbacks of using RAID storage instead of storing an extra copy of each data item? Answer: FILL 187 Suppose we wish to index a large relation that is partitioned Can the idea of partitioning (including virtual processor partitioning) be applied to indices? Explain your answer, considering the following two cases (assuming for simplicity that partitioning as well as indexing are on single attributes): a Where the index is on the partitioning attribute of the relation b Where the index is on an attribute other than the partitioning attribute of the relation Answer: FILL 188 Suppose a well-balanced range-partitioning vector had been chosen for a relation, but the relation is subsequently updated, making the partitioning unbalanced Even if virtual-processor partitioning is used,

Practice Exercises 25 a particular virtual processor may end up with a very large number of tuples after the update, and repartitioning would then be required a Suppose a virtual processor has a significant excess of tuples (say, twice the average) Explain how repartitioning can be done by splitting the partition, thereby increasing the number of virtual processors b If, instead of round-robin allocation of virtual processors, virtual partitions can be allocated to processors in an arbitrary fashion, with a mapping table tracking the allocation If a particular node has excess load (compared to the others), explain how load can be balanced c Assuming there are no updates, does query processing have to be stopped while repartitioning, or reallocation of virtual processors, is carried out? Explain your answer Answer: FILL