Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Similar documents
PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

COURSE 12. Parallel DBMS

Advanced Databases: Parallel Databases A.Poulovassilis

Chapter 17: Parallel Databases

It also performs many parallelization operations like, data loading and query processing.

Parallel Databases C H A P T E R18. Practice Exercises

Introduction to Data Management CSE 344

Database Architectures

data parallelism Chris Olston Yahoo! Research

Database Architectures

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Introduction to Database Systems CSE 414

Architecture and Implementation of Database Systems (Winter 2014/15)

Chained Declustering: A New Availability Strategy for Multiprocssor Database machines

CS-460 Database Management Systems

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Huge market -- essentially all high performance databases work this way

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

What We Have Already Learned. DBMS Deployment: Local. Where We Are Headed Next. DBMS Deployment: 3 Tiers. DBMS Deployment: Client/Server

Parallel DBMS. Chapter 22, Part A

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/15/18

Lecture 9: MIMD Architectures

Course Content. Parallel & Distributed Databases. Objectives of Lecture 12 Parallel and Distributed Databases

CSE 544: Principles of Database Systems

Lecture 9: MIMD Architectures

Chapter 18: Parallel Databases

Architecture-Conscious Database Systems

Module 10: Parallel Query Processing

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

FaRM: Fast Remote Memory

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Parallel DBMS. Lecture 20. Reading Material. Instructor: Sudeepa Roy. Reading Material. Parallel vs. Distributed DBMS. Parallel DBMS 11/7/17

Introduction to Data Management CSE 344

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Lecture 23 Database System Architectures

Computer-System Organization (cont.)

Chapter 20: Database System Architectures

Morsel- Drive Parallelism: A NUMA- Aware Query Evaluation Framework for the Many- Core Age. Presented by Dennis Grishin

Advanced Database Systems

Distributed Data Infrastructures, Fall 2017, Chapter 2. Jussi Kangasharju

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Current Topics in OS Research. So, what s hot?

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

CSE 344 MAY 2 ND MAP/REDUCE

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Chapter 18: Parallel Databases Chapter 19: Distributed Databases ETC.

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

R & G Chapter 13. Implementation of single Relational Operations Choices depend on indexes, memory, stats, Joins Blocked nested loops:

Parallel Relational Database Systems

Lecture 9: MIMD Architecture

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

6.830 Lecture 8 10/2/2017. Lab 2 -- Due Weds. Project meeting sign ups. Recap column stores & paper. Join Processing:

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Lecture 17. NUMA Architecture and Programming

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

The Google File System (GFS)

Review: Creating a Parallel Program. Programming for Performance

Introduction to Parallel Computing

Lecture 24: Memory, VM, Multiproc

TDT Appendix E Interconnection Networks

CS352 Lecture: Database System Architectures last revised 11/22/06

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy (Cont.) Storage Hierarchy. Magnetic Hard Disk Mechanism

Classifying Physical Storage Media. Chapter 11: Storage and File Structure. Storage Hierarchy. Storage Hierarchy (Cont.) Speed

Evaluation of Relational Operations: Other Techniques. Chapter 14 Sayyed Nezhadi

Implementation of Relational Operations

Decentralized Distributed Storage System for Big Data

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Query Processing and Advanced Queries. Query Optimization (4)

CSCI 4717 Computer Architecture

Parallel Nested Loops

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

CSE 344 MAY 7 TH EXAM REVIEW

Chapter 12: Query Processing. Chapter 12: Query Processing

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

An Introduction to Parallel Programming

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

Distributed KIDS Labs 1

CS5460: Operating Systems Lecture 20: File System Reliability

Hybrid Shipping Architectures: A Survey

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Introduction to Data Management CSE 344

Resiliency at Scale in the Distributed Storage Cloud

Computer parallelism Flynn s categories

High Performance Computing. Introduction to Parallel Computing

High-Performance Parallel Database Processing and Grid Databases

Transcription:

Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15

Lecture X: Parallel Databases

Topics Motivation and Goals Architectures Data placement Query processing Load balancing Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 3

Motivation Large volume of data => Use disk and large main memory I/O bottleneck (or memory access bottleneck) speed(disk) << speed(ram) << speed(microprocessor) Predictions (Micro-) processor speed growth: 50 % per year (Moore s Law) DRAM capacity growth: 4 x every three years Disk throughput: < 2 x in the last ten years Conclusion: the I/O bottleneck worsens => Increase the I/O bandwidth through parallelism Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 4

Motivation Also, Moore s Law doesn t quite apply any more because of the heat problem. Recent trend: Instead of fitting more chips on a single board, increase the number of processors. => The need for parallel processing N.B. Difference to distributed DDBMS not necessary independent, not necessary via network Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 5

I/O bottleneck Goals Increase the I/O bandwidth through parallelism Exploit multiple processors, multiple disks Intra-query parallelism (for response time) Inter-query parallelism (for throughput = # of transactions/second) High performance Overhead Load balancing High availability Exploit the existing redundancy Be careful about imbalance Extensibility Speed-up and Scalability Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 6

Extensibility Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 7

Today s Topics Parallel Databases Motivation and Goals Architectures Data placement Query processing Load balancing Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 8

Parallel System Architectures Shared-Memory Shared-Disk Shared-Nothing Hybrid NUMA Cluster Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 9

Shared-Memory Fast interconnect Single OS Advantages: Simplicity Easy load balancing Problems: High cost (the interconnect) Limited extensibility (~ 10 P s) Low availability Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 10

Shared-Disk Separate OS per P-M Advantages: Lower cost Higher extensibility (~ 100 P-M s) Load balancing Availability Problems: Complexity (cache consistency with lock-based protocols, 2PC, etc.) Overhead Disk bottleneck Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 11

Shared-Nothing Separate OS per P-M-D Each node ~ site Advantages: Extensibility and scalability Lower cost High availability Problems: Complexity Difficult load balancing Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 12

Hybrid Architectures Non Uniform Memory Architecture (NUMA) Cache-coherent NUMA Any P can access to any M. More efficient cache consistency supported by interconnect hardware Memory access cost Remote = 2-3 x Local Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 13

Hybrid Architectures Cluster Independent homogeneous server nodes at a single site Interconnect options LAN (cheap, slower) Myrinet, Infiniband, etc. (faster, low-latency) Shared-disk alternatives: NAS (Network-Attached Storage) -> low throughput SAN (Storage-Area Network) -> high cost of ownership Advantages of cluster architecture: Flexible and efficient as shared-memory Extensible and available as shared-disk/shared-nothing Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 14

The Google Cluster ~ 15,000 nodes of homogeneous commodity PCs [BDH 03] Currently: over 5,000,000 servers world-wide Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 15

Parallel Architectures Summary For small number of nodes: Shared-memory -> load balancing Shared-disk/Shared-nothing -> extensibility SAN w/ Shared-disk -> simple administration For large number of nodes: NUMA (~ 100 nodes) Cluster (~ 1000 nodes) Efficiency + Simplicity of Shared-memory Extensibility + Cost of Shared-disk/Shared-nothing Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 16

Topics Parallel Databases Motivation and Goals Architectures Data placement Query processing Load balancing Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 17

Parallel Data Placement Assume: shared-nothing (most general and common) To reduce communication costs, programs should be executed where the data reside. Similar to distributed DBMS s: Fragmentation Differences: Users are not associated with particular nodes. Load balancing for large number of nodes is harder. How to place the data so that the system performance is maximized? partitioning (min. response time) vs. clustering (min. total time) Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 18

Data Partitioning Each relation is divided into n partitions that are mapped onto different disks. Implementation Round-robin Maps i-th element to node i mod n Simple but only exact-match queries Range B-tree index Supports range queries but large index Hashing Hash function Only exact-match queries but small index Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 19

Full Partitioning Schemes Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 20

Variable Partitioning Each relation is partitioned across a certain number of nodes (instead of all), depending on its: size access frequency Periodic reorganization for load balancing Global index replicated on each node to provide associative access + Local indices Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 21

Global and Local Indices Example Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 22

Replicated Data Partitioning for H/A High-Availability requires data replication simple solution is mirrored disks hurts load balancing when one node fails more elaborate solutions achieve load balancing interleaved partitioning (Teradata) chained partitioning (Gamma) Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 23

Replicated Data Partitioning for H/A Interleaved Partitioning Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 24

Replicated Data Partitioning for H/A Chained Partitioning Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 25

Topics Parallel Databases Motivation and Goals Architectures Data placement Query processing Load balancing Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 26

Parallel Query Processing Query parallelism inter-query intra-query inter-operator intra-operator Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 27

Inter-operator Parallelism Pipeline parallelism Join and Select execute in parallel. Independent parallelism The two Select s execute in parallel. Example Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 28

Intra-operator Parallelism Example Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 29

Parallel Join Processing Three basic algorithms for intra-operator parallelism: Parallel Nested Loop Join: no special assumptions Parallel Associative Join: assumption: one relation is declustered on join attribute + equi-join Parallel Hash Join: assumption: equi-join They also apply to other complex operators such as duplicate elimination, union, intersection, etc. with minor adaptation. Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 30

Parallel Nested Loop Join R S R S i 1 Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 31 n i

Parallel Associative Join R S ( Ri Si ) i 1 Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 32 n

Parallel Hash Join R S ( Ri Si ) i 1 p Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 33

Which one to use? Use Parallel Associative Join where applicable (i.e., equi-join + partitioning based on the join attribute). Otherwise, compute total communication + processing cost for Parallel Nested Loop Join and Parallel Hash Join, and use the one with the smaller cost. Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 34

Topics Parallel Databases Motivation and Goals Architectures Data placement Query processing Load balancing Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 35

Three Barriers to Extensibility Ideal Curves Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 36

Load Balancing Skewed data distributions in intra-operator parallelism make load balancing harder. Attribute Value Skew (AVS) Tuple Placement Skew (TPS) Selectivity Skew (SS) Redistribution Skew (RS) Join Product Skew (JPS) Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 37

Data Skew Example Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 38

Load Balancing Techniques Intra-operator load balancing Adaptive techniques (adapt to skew by dynamic load reallocation) Specialized techniques (switch between specialized parallel join algorithms that can deal with skew) Inter-operator load balancing (increase pipeline parallelism) Intra-query load balancing (combine the two) Uni Freiburg, WS2014/15 Systems Infrastructure for Data Science 39