Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Similar documents
COURSE 12. Parallel DBMS

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Huge market -- essentially all high performance databases work this way

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Parallel DBMS. Chapter 22, Part A

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Chapter 17: Parallel Databases

CSE 544: Principles of Database Systems

Chapter 20: Database System Architectures

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

data parallelism Chris Olston Yahoo! Research

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

Course Content. Parallel & Distributed Databases. Objectives of Lecture 12 Parallel and Distributed Databases

Lecture 23 Database System Architectures

Architecture and Implementation of Database Systems (Winter 2014/15)

'on~ 4 A" V. E~% P "MVL N A. NK gi N. The ammadataase achie Prjec. byg. Rik14X Rasuse. The Gam. marc 1990s ahieprjc S A41, by6 0 4

Advanced Databases: Parallel Databases A.Poulovassilis

Introduction to Database Systems CSE 414

CS352 Lecture: Database System Architectures last revised 11/22/06

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

F project has focused on issues associated with the design

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

It also performs many parallelization operations like, data loading and query processing.

Chapter 18: Parallel Databases

Parallel Query Optimisation

Database Architectures

User Perspective. Module III: System Perspective. Module III: Topics Covered. Module III Overview of Storage Structures, QP, and TM

CSE 544, Winter 2009, Final Examination 11 March 2009

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Database Architectures

Multiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Multiprocessor Systems. COMP s1

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Architecture-Conscious Database Systems

CS-460 Database Management Systems

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Advanced Database Systems

Outline. Database Tuning. Ideal Transaction. Concurrency Tuning Goals. Concurrency Tuning. Nikolaus Augsten. Lock Tuning. Unit 8 WS 2013/2014

Database Architectures

CSE 344 MAY 2 ND MAP/REDUCE

Database Management and Tuning

Kathleen Durant PhD Northeastern University CS Indexes

Introduction to Data Management CSE 344

OS and Hardware Tuning

OS and HW Tuning Considerations!

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Database Technology Database Architectures. Heiko Paulheim

CPS352 Lecture: Database System Architectures last revised 3/27/2017

7. Query Processing and Optimization

Database Applications (15-415)

Multiprocessor Systems. Chapter 8, 8.1

DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Chapter 12: Query Processing. Chapter 12: Query Processing

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

Chapter 13: Query Processing

CSE 124: Networked Services Lecture-16

Crescando: Predictable Performance for Unpredictable Workloads

Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC.

CSE 544 Principles of Database Management Systems

CSE 124: Networked Services Fall 2009 Lecture-19

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

The Google File System (GFS)

Main-Memory Databases 1 / 25

Database Applications (15-415)

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

CSE544 Database Architecture

Chained Declustering: A New Availability Strategy for Multiprocssor Database machines

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Chapter 12: Query Processing

Greenplum Architecture Class Outline

What is Data Warehouse like

Database Applications (15-415)

CSE 190D Spring 2017 Final Exam Answers

Hybrid Shipping Architectures: A Survey

The physical database. Contents - physical database design DATABASE DESIGN I - 1DL300. Introduction to Physical Database Design

Lecture 9: MIMD Architectures

Advanced Database Systems

Parallel DBs. April 23, 2018

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy (Cont.) Storage Hierarchy. Magnetic Hard Disk Mechanism

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy. Storage Hierarchy (Cont.) Speed

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

The Google File System

Chapter 3. Database Architecture and the Web

Transcription:

Parallel Database Systems STAVROS HARIZOPOULOS stavros@cs.cmu.edu Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions Parallelism in DBMSs Demand for higher throughput Opportunity in relational data model Hardware trends and paradigms Enabling technologies Information explosion Magnetic storage is cheaper than paper All information goes online We might record everything we read: 10 MB/day hear: 400 MB/day see: 40 GB/day Data storage, organization, and analysis is a challenge Relational DBMSs Relational data model was universally adopted Relational DBMS parallelism Pipelined parallelism Relational queries are ideal for parallel execution uniform operators apply to uniform data streams consume 1-2 relations and produce a new relation Dataflow approach requires messaging based systems high speed interconnect Scan Partitioned parallelism split N ways / merge M ways Sequential Sequential Sequential Scan Sort Sequential Sort

Trends and paradigms Mainframe increasingly expensive Economy of scale off-the-shelf components Trends in networks, storage, memory, and CPUs Bottlenecks shift, new issues arise Enabling technologies Client-server / networking software An idea whose time has passed? Database machine: research in 1975-1985 Exotic technologies lead to failures.. bubble memory head per track disks extra logic added on disk heads (note the comeback with Active Disks) I/O traditionally assumed as bottleneck Parallel DBMSs found their way The market today Academia Wisconsin: from DIRECT to GAMMA Berkeley: XPRS Commercial Teradata started in 1984 others: Tandem, Oracle, Informix, Navigator, DB2, Redbrick Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Shared-Memory CPUs share direct access to global RAM and disks P 1 P 2 P n Interconnection Network Global Shared Memory Bonus: NCR / Teradata Conclusions Shared Memory Multiprocessor

Shared-Disk CPUs still share disks but have private RAM Shared-Nothing CPUs own RAM and disk(s) P 1 P 2 P n Interconnection Network P 1 P 2 P n Interconnection Network Shared Disk Multiprocessor Pros and cons Shared systems don t scale resource contention interconnect bandwidth must equal #nodes concurrency problems Fixes? private cache affinity scheduling Pros and cons (cont d) Shared nothing scales better minimal resource sharing / contention low traffic on interconnect commodity components Which is easier to program? build? scaleup? Performance metrics Speedup add nodes to run faster a fixed problem Scaleup add nodes to run at the same time a bigger problem Transaction Scaleup more clients / servers, same response time Barriers to linear throughput Startup Interference even 1% increased contention limits speedup to 37 Skew at fine granularity variance can exceed mean service

Barriers to linear throughput Outline Background OldTime NewTime Speedup = The Good Speedup Curve Linearity Processors & Discs A Bad Speedup Curve No Parallelism Benefit Linearity Processors & Discs A Bad Speedup Curve 3-Factors Startup Interference Processors & Discs Skew Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions Dataflow approach to SQL Relational properties uniform data stream relations are created, updated, queried via SQL i.e. scan = select + project SQL benefits data independence non-procedural can be executed as dataflow graph Achieving parallelism Data partitioning of relations Pipelining relational operators Partitioned execution of relational operators Limits to pipelining Pipelines are inherently short Some operators are not pipelineable Skew limits speedup Data partitioning I/O happens in parallel Three basic strategies round robin hash range Partitioning helps both seq. and assoc. scans Furtherpartitioninghelpsuptoapoint

Round-robin partitioning Hash partitioning A...E F...J K...N O...S T...Z Good for sequential scans / spread load A...E F...J K...N O...S T...Z Good for equijoins Range partitioning Parallelizing Relational Operators Reuse existing implementations Shared-nothing helps A...E F...J K...N O...S T...Z Good for range queries Only three mechanisms needed operator replication merge operator split operator Result is linear speedup and scaleup Parallelizing Relational Operators Operators on partitioned data (1) Operator replication linear scaleup minus starting cost specialized operators (e.g. hash join - see GAMMA) Merge operator combine many streams into one Split operator map from attr. values to destination processes

Operators on partitioned data (2) Summary Exotic technologies yield to inexpensive hardware Shared-nothing serves better parallel DBMSs Potential of parallel database systems Many implementations (successful: Teradata) Research issues (at the end of the presentation) Outline Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions History DIRECT 1977-84 early database machine project showed parallelism useful for db apps Flaws curtailed scalability shared memory central control of execution Gamma 1984-92 Key ideas in Gamma Shared-nothing Hash-based parallel algorithms Horizontal partitioning Gamma hardware (v1.0) 17 VAX 11/750 processors 2 MB RAM per node 80 Mb/s token ring Separate VAX running Unix (host) 333 MB Fujitsu drives at 8 processors

Gamma hardware (v1.0) issues 2K DB pages due to token ring Unibus congestion (network and I/O faster) corrected with a backplane card VAX obsolete 2MB with no virtual memory was tight Gamma hardware (v2.0) 1988 ipsc/2 Intel hypercube 32 x386 processors 8MB of memory 330MB Maxtor drive / node (45KB cache) Routing modules 2.8 Mb/s full duplex, serial, reliable Gamma software (v2.0) OS: NOSE multiple, lightweight processes with shared memory Entire DB in one NX/2 process Details: renaming nodes 10% CPU used for copying Excessive interrupts during I/O Storage organization Horizontal partitioning (user selectable) round-robin hashed range partitioned Clustered index on different attributes Partition relations should have based on heat Gamma process structure Gamma process structure (figure) Catalog manager repository for db schema Query manager one associated with each user Scheduler processes coordinates multi-site queries Operator processes executes single relational operator

Query processing ad-hoc and embedded query interfaces standard parsing, optimization, code gen. left deep trees only hash joins only at most two join operators active simultaneously split tables Split table Directs operator output to appropriate node Query processing: an example Query processing: an example Selections Start selection operator on each node Exclusion of nodes for hash and range partitioning Throughput considerations one page read-ahead Joins Partition into buckets, join buckets Implemented: sort-merge, Grace, Simple, Hybrid Parallel Hybrid

Aggregation Compute partial results for each partition Hash on group-by attribute Updates Standard techniques Update to partitioning attribute Concurrency control 2PL Granularity: file and page Modes: S, X, IS, IX, SIX Local lock manager and deadlock detector wait-for graph Centralized multi-site lock detector Logging and recovery Log sequence number LSN = f (node number, local sequence number) Processor i directs log records at log manager (i mod M), where M = # log mgrs. Standard WAL protocol local log manager reduces time waiting for log manager ARIES Node failure Node failure - declustering Availability in spite of processor or disk fail Mirrored disk (Tandem) Interleaved declustering (Teradata) Chained declustering Load redirection results in 1/n increase

Node failure - load redirection Performance experiments Selection relation size speedup scaleup Join (similar) Aggregate Update (no recovery) Outline NCR / Teradata Background Hardware architectures and performance metrics Parallel database techniques Gamma Bonus: NCR / Teradata Conclusions Some research problems Parallel query optimization Concurrency Physical database design Scheduling (load balancing, priority) Application program parallelism On-line data reorganization