Course Content. Parallel & Distributed Databases. Objectives of Lecture 12 Parallel and Distributed Databases

Similar documents
Distributed Databases

Distributed Databases

Distributed Databases

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Parallel DBMS. Chapter 22, Part A

DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

COURSE 12. Parallel DBMS

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

It also performs many parallelization operations like, data loading and query processing.

Chapter 18: Parallel Databases

Distributed KIDS Labs 1

Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC.

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Advanced Databases: Parallel Databases A.Poulovassilis

Database Architectures

Chapter 19: Distributed Databases

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Database Management Systems

CSE 444: Database Internals. Lecture 22 Distributed Query Processing and Optimization

Database Architectures

Distributed Databases. Distributed Database Systems. CISC437/637, Lecture #21 Ben Cartere?e

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

MaanavaN.Com DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING QUESTION BANK

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Chapter 17: Parallel Databases

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Distributed DBMS. Concepts. Concepts. Distributed DBMS. Concepts. Concepts 9/8/2014

Huge market -- essentially all high performance databases work this way

Introduction to Database Systems CSE 444. Lecture 26: Distributed Transactions

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

Chapter 19: Distributed Databases

Chapter 11 - Data Replication Middleware

Architecture and Implementation of Database Systems (Winter 2014/15)

Administrivia. CS186 Class Wrap-Up. News. News (cont) Top Decision Support DBs. Lessons? (from the survey and this course)

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

High-Performance Parallel Database Processing and Grid Databases

Database Architectures

Distributed Query Processing. Banking Example

Database Applications (15-415)

Database Applications (15-415)

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe Slide 25-1

Chapter 20: Database System Architectures

Query Processing and Query Optimization. Prof Monika Shah

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Principles of Data Management. Lecture #9 (Query Processing Overview)

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 14 Distributed Transactions

Scaling Database Systems. COSC 404 Database System Implementation Scaling Databases Distribution, Parallelism, Virtualization

Database Management Systems (CS 601) Assignments

Examples of Physical Query Plan Alternatives. Selected Material from Chapters 12, 14 and 15

Query Evaluation! References:! q [RG-3ed] Chapter 12, 13, 14, 15! q [SKS-6ed] Chapter 12, 13!

GFS: The Google File System. Dr. Yingwu Zhu

Lecture 23 Database System Architectures

Database Applications (15-415)

CS352 Lecture: Database System Architectures last revised 11/22/06

CMSC 461 Final Exam Study Guide

CSE 544: Principles of Database Systems

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Distributed Systems. Characteristics of Distributed Systems. Lecture Notes 1 Basic Concepts. Operating Systems. Anand Tripathi

Distributed Systems. Characteristics of Distributed Systems. Characteristics of Distributed Systems. Goals in Distributed System Designs

Chapter 3. Database Architecture and the Web

Datenbanksysteme II: Caching and File Structures. Ulf Leser

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system

Implementation of Relational Operations

Overview of Implementing Relational Operators and Query Evaluation

Evaluation of relational operations

CPSC 421 Database Management Systems. Lecture 11: Storage and File Organization

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

CSE 544 Principles of Database Management Systems

Database Applications (15-415)

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Database Applications (15-415)

External Sorting Implementing Relational Operators

Transaction Management: Concurrency Control

CPSC 421 Database Management Systems. Lecture 19: Physical Database Design Concurrency Control and Recovery

Course Content. Object-Oriented Databases. Objectives of Lecture 6. CMPUT 391: Object Oriented Databases. Dr. Osmar R. Zaïane. University of Alberta 4

Database Technology Database Architectures. Heiko Paulheim

Implementing Joins 1

L9: Storage Manager Physical Data Organization

CSE 444: Database Internals. Section 9: 2-Phase Commit and Replication

Evaluation of Relational Operations

Distributed File Systems II

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Database Applications (15-415)

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Database Management Systems need to:

Storing Data: Disks and Files. Storing and Retrieving Data. Why Not Store Everything in Main Memory? Chapter 7

CS3600 SYSTEMS AND NETWORKS

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

management systems Elena Baralis, Silvia Chiusano Politecnico di Torino Pag. 1 Distributed architectures Distributed Database Management Systems

Chapter 2 Distributed Information Systems Architecture

Storing and Retrieving Data. Storing Data: Disks and Files. Solution 1: Techniques for making disks faster. Disks. Why Not Store Everything in Tapes?

Overview of Query Processing. Evaluation of Relational Operations. Why Sort? Outline. Two-Way External Merge Sort. 2-Way Sort: Requires 3 Buffer Pages

Transcription:

Database Management Systems Winter 2003 CMPUT 391: Dr. Osmar R. Zaïane University of Alberta Chapter 22 of Textbook Course Content Introduction Database Design Theory Query Processing and Optimisation Concurrency Control Data Base Recovery and Security Object-Oriented Databases Inverted Index for IR Spatial Data Management XML and Databases Data Warehousing Data Mining Parallel and Distributed Databases University of Alberta 1 University of Alberta 2 Objectives of Lecture 12 Parallel and Distributed Databases Get a general idea about what parallel and Distributed databases are Get an overview of what can be parallelized in DMBS (Query, Operations,Updating) Get acquainted with the existing architectures for parallel databases University of Alberta 3 University of Alberta 4

Why Parallel Access To Data? 1 Terabyte 10 MB/s Bandwidth 1 Terabyte Parallelism: divide a big problem into many smaller ones to be solved in parallel. University of Alberta 5 Why Parallel Access To Data? Motivations : ÿperformance ÿincreased Availability ÿdistributed Access to Data ÿanalysis of Distributed data University of Alberta 6 Why Parallel Access To Data? What is a Parallel Database System? Is the one that seeks to improve performance through parallel implementations of various operations such as : Loading data, Building indexes & evaluations of queries. Where the data are stored either in distributed fashion or centralized. Architecture of Parallel Databases Shared Memory (SMP) CLIENTS Processors Memory Easy to program Expensive to build Difficult to scaleup Shared Disk CLIENTS Shared Nothing (network) CLIENTS Hard to program Cheap to build Easy to scaleup University of Alberta 7 University of Alberta 8

Architecture of Parallel Databases Speed-Up More resources means proportionally less time for given amount of data. Scale-Up If resources increased in proportion to increase in data size, time is constant. Transaction/sec. (throughput) sec./transaction (response time) degree of -ism Ideal degree of -ism University of Alberta 9 University of Alberta 10 Parallel Query Evaluation Parallelism of Queries can be done using : Pipeline parallelism: many machines each doing one step in a multi-step process. Partition parallelism: many machines doing the same thing to different pieces of data. Pipeline Partition Any Sequential program Program Sequential Sequential Any Sequential Program Any Sequential Program program Any Sequential Program outputs split N ways, inputs merge M ways University of Alberta 11 Parallel Query Evaluation Partitioning a table: Range Hash Round Robin A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z Good for equi-joins, Good for equi-joins Good to spread load range queries group-by Reduce Data Skew Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning University of Alberta 12

Parallelizing individual operations Parallel Query Optimization Bulk Loading and Scanning : Pages can be read in parallel while scanning the relations Sorting : Each Processor sorts its local portion Joins : Join the sub results into the final one (many ways) Issues to be considered Cost : Optimizer should estimate operation costs. Speed : The fastest answers may not be the cheapest Dataflow Network for parallel Join University of Alberta 13 University of Alberta 14 Distributed Databases Data is stored at several sites, each managed by a DBMS that can run independently. Distributed Data Independence: Users should not have to know where data is located (extends Physical and Logical Data Independence principles). Distributed Transaction Atomicity: Users should be able to write transactions accessing multiple sites just like local transactions. University of Alberta 15 University of Alberta 16

Distributed Databases ÿusers have to be aware of where data is located, i.e., Distributed Data Independence and Distributed Transaction Atomicity are not supported. ÿthese properties are hard to support efficiently. ÿfor globally distributed sites, these properties may not even be desirable due to administrative overheads of making location of data transparent. Types Distributed Databases Homogeneous: Every site runs same type of DBMS. Heterogeneous: Different sites run different DBMSs (different RDBMSs or even nonrelational DBMSs). DBMS1 DBMS2 DBMS3 Gateway University of Alberta 17 University of Alberta 18 Distributed DBMS Architectures Distributed DBMS Architectures Client-Server QUERY Collaborating-Server Client ships query to single site. All query processing at server. - Thin vs. fat clients. - Set-oriented communication, client side caching. CLIENT CLIENT Query can span multiple sites. QUERY CLIENT University of Alberta 19 University of Alberta 20

Distributed DBMS Architectures Middleware System One Server manages queries and transactions spans Middleware multiple servers QUERY CLIENT University of Alberta 21 University of Alberta 22 Storing Data in a Distributed DBMS Storing Data in a Distributed DBMS Fragmentation TID t1 t2 t3 t4 Horizontal: Usually disjoint. Replication Gives increased availability. Faster query evaluation. Synchronous vs. Asynchronous. Vary in how current copies are. R1 R1 R3 R2 SITE A SITE B Vertical: Lossless-join; tids. University of Alberta 23 University of Alberta 24

Distributed Catalog Management Must keep track of how data is distributed across sites. Must be able to name each replica of each fragment. To preserve local autonomy: <local-name, birth-site> Site Catalog: Describes all objects (fragments, replicas) at a site + Keeps track of replicas of relations created at this site. To find a relation, look up its birth-site catalog. Birth-site never changes, even if relation is moved. University of Alberta 25 University of Alberta 26 Distributed Queries Processing Distributed Queries Processing Horizontally Fragmented: Tuples with rating < 5 at Shanghai, >= 5 at Tokyo. Must compute SUM(age), COUNT(age) at both sites. If WHERE contained just S.rating>6, just one site. SELECT AVG(S.age) FROM Sailors S WHERE S.rating > 3 AND S.rating < 7 Vertically Fragmented: sid and rating at Shanghai, sname and age at Tokyo, tid at both. SELECT AVG(S.age) FROM Sailors S WHERE S.rating > 3 AND S.rating < 7 -Must reconstruct relation by join on tid, then evaluate the query. University of Alberta 27 University of Alberta 28

Distributed Queries Processing Replicated: Sailors copies at both sites. SELECT AVG(S.age) FROM Sailors S WHERE S.rating > 3 AND S.rating < 7 Choice of site based on local costs, shipping costs. Distributed Queries Processing Distributed Joins LONDON PARIS Fetch as Needed, Sailors Reserves Page NL, Sailors as outer: 500 pages 1000 pages Cost: 500 D + 500 * 1000 (D+S) D is cost to read/write page; S is cost to ship page. If query was not submitted at London, must add cost of shipping result to query site. Can also do INL at London, fetching matching Reserves tuples to London as needed. University of Alberta 29 University of Alberta 30 Distributed Queries Processing Distributed Joins, Cont LONDON PARIS Ship to One Site: Sailors Reserves 500 pages 1000 pages Ship Reserves to London. Cost: 1000 S + 4500 D (SM Join; cost = 3*(500+1000)) If result size is very large, may be better to ship both relations to result site and then join them! University of Alberta 31 University of Alberta 32

Updating Distributed Data Synchronous Replication: All copies of a modified relation (fragment) must be updated before the modifying Xact commits. Data distribution is made transparent to users. Asynchronous Replication: Copies of a modified relation are only periodically updated; different copies may get out of synch in the meantime. Users must be aware of data distribution. Current products follow this approach. University of Alberta 33 Updating Distributed Data Synchronous Replication Voting: Transaction must write a majority of copies to modify an object; must read enough copies to be sure of seeing at least one most recent copy. E.g., 10 copies; 7 written for update; 4 copies read. Each copy has version number. Not attractive usually because reads are common. Read-any Write-all: Writes are slower and reads are faster, relative to Voting. Most common approach to synchronous replication. Choice of technique determines which locks to set. University of Alberta 34 Updating Distributed Data Asynchronous Replication Allows modifying transaction to commit before all copies have been changed (and readers nonetheless look at just one copy). Users must be aware of which copy they are reading, and that copies may be out-of-sync for short periods of time. Two approaches: Primary Site and Peer-to-Peer replication. Difference lies in how many copies are ``updatable or ``master copies. University of Alberta 35 University of Alberta 36

Distributed Transactions Transactions are divides in each site as sub transactions. Coordination between these sub transactions need : Distributed Concurrency Control Distributed Recovery Distributed Transactions Distributed Concurrency Control How do we manage locks for objects across many sites? Centralized: One site does all locking. Vulnerable to single site failure. Primary Copy: All locking for an object done at the primary copy site for this object. Reading requires access to locking site as well as site where the object is stored. Fully Distributed: Locking for a copy done at site where the copy is stored. Locks at all sites while writing an object. University of Alberta 37 University of Alberta 38 Distributed Transactions Distributed Deadlock Detection Each site maintains a local waits-for graph. A global deadlock might exist even if the local graphs contain no cycles: T1 T2 T1 T2 T1 T2 SITE A SITE B GLOBAL Three solutions: Centralized (send all local graphs to one site); Hierarchical (organize sites into a hierarchy and send local graphs to parent in the hierarchy); Timeout (abort transaction if it waits too long). Distributed Transactions Distributed Recovery Two new issues: New kinds of failure, e.g., links and remote sites. If sub-transactions of a transaction execute at different sites, all or none must commit. Need a commit protocol to achieve this. A log is maintained at each site, as in a centralized DBMS, and commit protocol actions are additionally logged. University of Alberta 39 University of Alberta 40