Database System Architectures Parallel DBs, MapReduce, ColumnStores

Similar documents
MapReduce: Simplified Data Processing on Large Clusters

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Overview. Why MapReduce? What is MapReduce? The Hadoop Distributed File System Cloudera, Inc.

Parallel Computing: MapReduce Jin, Hai

Practice and Applications of Data Management CMPSCI 345. Lecture 18: Big Data, Hadoop, and MapReduce

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Introduction to Data Management CSE 344

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to Data Management CSE 344

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CSE 344 MAY 2 ND MAP/REDUCE

L22: SC Report, Map Reduce

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

The MapReduce Abstraction

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Nested Loops

CS 61C: Great Ideas in Computer Architecture. MapReduce

MapReduce. Kiril Valev LMU Kiril Valev (LMU) MapReduce / 35

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Database Systems CSE 414

Big Data Management and NoSQL Databases

MapReduce. Simplified Data Processing on Large Clusters (Without the Agonizing Pain) Presented by Aaron Nathan

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Advanced Database Technologies NoSQL: Not only SQL

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

Introduction to MapReduce

Parallel Programming Concepts

ECE5610/CSC6220 Introduction to Parallel and Distribution Computing. Lecture 6: MapReduce in Parallel Computing

The MapReduce Framework

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce-style data processing

Agenda. Request- Level Parallelism. Agenda. Anatomy of a Web Search. Google Query- Serving Architecture 9/20/10

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Flexible Data Processing Tool

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 11 Parallel DBMSs and MapReduce

The NoSQL Movement. FlockDB. CSCI 470: Web Science Keith Vertanen

In the news. Request- Level Parallelism (RLP) Agenda 10/7/11

Introduction to Data Management CSE 344

Concurrency for data-intensive applications

Introduction to MapReduce

E Learning Using Mapreduce

Hadoop vs. Parallel Databases. Juliana Freire!

MapReduce for Data Warehouses 1/25

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Today s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17

MI-PDB, MIE-PDB: Advanced Database Systems

Introduction to Map Reduce

Introduction to Hadoop and MapReduce

MapReduce and Hadoop

Large-Scale GPU programming

CompSci 516: Database Systems. Lecture 20. Parallel DBMS. Instructor: Sudeepa Roy

Introduction to MapReduce

Map Reduce Group Meeting

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

HadoopDB: An open source hybrid of MapReduce

How to Implement MapReduce Using. Presented By Jamie Pitts

MapReduce & HyperDex. Kathleen Durant PhD Lecture 21 CS 3200 Northeastern University

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Warehouse-Scale Computing, MapReduce, and Spark

1. Introduction to MapReduce

Google: A Computer Scientist s Playground

Google: A Computer Scientist s Playground

Welcome to the New Era of Cloud Computing

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Introduction to MapReduce

An introduction to Big Data. Presentation by Devesh Sharma, Zubair Asghar & Andreas Aalsaunet

Lecture #30 Parallel Computing

Why compute in parallel?

CSE 414: Section 7 Parallel Databases. November 8th, 2018

Distributed Computing

Tutorial Outline. Map/Reduce. Data Center as a computer [Patterson, cacm 2008] Acknowledgements

Today s Lecture. CS 61C: Great Ideas in Computer Architecture (Machine Structures) Map Reduce

MapReduce, Hadoop and Spark. Bompotas Agorakis

CS61C : Machine Structures

Dept. Of Computer Science, Colorado State University

Parallel Data Processing with Hadoop/MapReduce. CS140 Tao Yang, 2014

Azure MapReduce. Thilina Gunarathne Salsa group, Indiana University

Pivoting Entity-Attribute-Value data Using MapReduce for Bulk Extraction

Introduction to Data Management CSE 344

Distributed Systems. 18. MapReduce. Paul Krzyzanowski. Rutgers University. Fall 2015

Data Intensive Computing

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Part A: MapReduce. Introduction Model Implementation issues

MapReduce. Cloud Computing COMP / ECPE 293A

Column Stores vs. Row Stores How Different Are They Really?

Review. Agenda. Request-Level Parallelism (RLP) Google Query-Serving Architecture 8/8/2011

Data Management in the Cloud: Limitations and Opportunities. Daniel Abadi Yale University January 30 th, 2009

DryadLINQ. by Yuan Yu et al., OSDI 08. Ilias Giechaskiel. January 28, Cambridge University, R212

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Transcription:

Database System Architectures Parallel DBs, MapReduce, ColumnStores CMPSCI 445 Fall 2010 Some slides courtesy of Yanlei Diao, Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet

Motivation: Large Scale Data Processing Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of commodity computers New definition of cluster computing: large numbers of low-end processors working in parallel to solve a computing problem. Parallel DB: shared-nothing architecture: many interconnected machines with independent CPUs and disks Want to make this easy

Parallel Database Systems We ve studied centralized and client-server systems so far. For massive datasets or extremely large numbers of transactions, parallel systems are often required. Parallel database system: Multiple CPUs and disks used in parallel

Speed-up Running a given task in less time by increasing the degree of parallelism is called speedup. Linear speedup Speed Sublinear speedup Resources

Scale-up Handling larger tasks by increasing the degree of parallelism is called scaleup. Time Improvement Linear scaleup Sublinear scaleup Increasing Problem Size & Resources -->

Obstacles in parallel systems Start-up costs Starting each process has a cost. If the start-up time dominates the actual processing time, speedup is adversely affected. Interference Processing executing in parallel may need to access shared resources (system bus, shared disks, locks). Scaleup and speedup may be hurt. Skew By dividing a task into many subtasks to execute in parallel, we complete when the last subtask completes. If subtasks are skewed, then we are limited by longest subtask.

Parallel databases Research over the last 20 years Major issues: Data partitioning Managing skew Intraquery and interquery parallelism Query optimization can become very difficult A number of mature commercial systems Teradata, Netezza, Greenplum,... Not many open-source systems

MapReduce Automatic parallelization & distribution Fault-tolerance Status and monitoring tools Clean abstraction for programmers

Programming Model Borrows from functional programming Users implement an interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

map Input key-value pairs: records from the data source, e.g., lines out of files (filename, line), rows of a database, etc. map( ) is applied independently to each line or record. map( ) produces one or more intermediate values along with an output key from the input. (line, long, lat, month, day, year, temp) (0057, +51317, +028783, 01, 01, 1950, 39) (0058, +51321, +018253, 02, 13, 1951, 32) map (year, temp) (1950, 39) (1951, 32)

reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce( ) combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key) (1950, (39, 20, 18, 32) ) (1951, (32, 8, 23, 12, 10, 12) )... (2010, (31, 29, 27, 27, 18, 12) ) reduce (max) (1950, 39) (1951, 32)... (2010, 31)

Example: Count Word Occurrences map(string input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(string output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); How do we implement this using a relational DBMS? Customized data loading (data may be used only once), then Group By.

Extract (key, value) using map(); Group By key Apply reduce()

Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can t start until map phase is completely finished.

Fault Tolerance Master detects worker failures Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. Effect: Can work around bugs in third-party libraries!

Optimizations No reduce can start until map is complete: A single slow disk controller can rate-limit the whole process Master redundantly executes slow-moving map tasks; uses results of first copy to finish

Optimizations Combiner functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner?

Some Comments on MapReduce Strengths Simplicity and low cost Infrastructure support: massive parallelism, fault tolerance, with proven success Complex map functions for parse, transform; complex analytics for data mining, clustering, etc. Storage system independent; heterogeneous storage systems Limitations: Performance for structured data analysis. No indexes; often large, repeated scans. Language not declarative Data model is limited

Column Oriented Databases Can be used in centralized or parallel deployments Pages are collections of the attribute values of a single column of a table --> think extreme decomposition Clear performance advantages for read-mostly workloads which access small number of columns. Standard databases can simulate column stores: vertical partitioning and materialization indexing of individual columns Current column oriented databases improve on this: aggressive use of compression late joining

Articles to Read MapReduce: Simplified Data Processing on Large Clusters. Jeffrey Dean and Sanjay Ghemawat. OSDI 2004. A Comparison of Approaches to Large-Scale Data Analysis. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. SIGMOD 2009. MapReduce and Parallel DBMSs: Friends or Foes?. Michael Stonebraker, Daniel Abadi, David J. DeWitt, Sam Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin. CACM Jan 2010. MapReduce: A Flexible Data Processing Tool. Jeffrey Dean, Sanjay Ghemawat. CACM Jan 2010.