Today s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17

Similar documents
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

HadoopDB: An open source hybrid of MapReduce

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Evolution From Shark To Spark SQL:

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia Final Exam. Administrivia Final Exam

OLTP vs. OLAP Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Shark: Hive (SQL) on Spark

A Comparison of Approaches to Large-Scale Data Analysis

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Database System Architectures Parallel DBs, MapReduce, ColumnStores

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

April Copyright 2013 Cloudera Inc. All rights reserved.

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

DryadLINQ. by Yuan Yu et al., OSDI 08. Ilias Giechaskiel. January 28, Cambridge University, R212

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Big Data Management and NoSQL Databases

Backtesting with Spark

Hadoop vs. Parallel Databases. Juliana Freire!

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Database in the Cloud Benchmark

A comparative analysis of state-of-the-art SQL-on-Hadoop systems for interactive analytics

MI-PDB, MIE-PDB: Advanced Database Systems

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Shark: SQL and Rich Analytics at Scale

Oracle Big Data Connectors

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Shark: SQL and Rich Analytics at Scale

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

Cloud Analytics Database Performance Report

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Infrastructures & Technologies. SQL on Big Data

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

MapR Enterprise Hadoop

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

How to survive the Data Deluge: Petabyte scale Cloud Computing

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Large-Scale Data Engineering

Hadoop/MapReduce Computing Paradigm

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Benchmarking Cloud-based Data Management Systems

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

MapReduce for Data Warehouses 1/25

CS 61C: Great Ideas in Computer Architecture. MapReduce

Distributed Filesystem

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Sparrow. Distributed Low-Latency Spark Scheduling. Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

Outline. Parallel Database Systems. Information explosion. Parallelism in DBMSs. Relational DBMS parallelism. Relational DBMSs.

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Big Data Infrastructures & Technologies

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Apache Hive for Oracle DBAs. Luís Marques

Tuning Intelligent Data Lake Performance

BigData and Map Reduce VITMAC03

MapReduce. U of Toronto, 2014

Cloud Computing & Visualization

MapReduce: A Flexible Data Processing Tool

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Accelerate Big Data Insights

The Stratosphere Platform for Big Data Analytics

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

Is Memory Disaggregation Feasible? A Case Study with Spark SQL

Introduction to Data Management CSE 344

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

LeanBench: comparing software stacks for batch and query processing of IoT data

Resilient Distributed Datasets

Jumbo: Beyond MapReduce for Workload Balancing

Tuning Intelligent Data Lake Performance

Spark & Spark SQL. High- Speed In- Memory Analytics over Hadoop and Hive Data. Instructor: Duen Horng (Polo) Chau

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Map-Reduce. Marco Mura 2010 March, 31th

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

EsgynDB Enterprise 2.0 Platform Reference Architecture

CSE 190D Spring 2017 Final Exam

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Using space-filling curves for multidimensional

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Today s Paper. Routers forward packets. Networks and routers. EECS 262a Advanced Topics in Computer Systems Lecture 18

microsoft

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Flash Storage Complementing a Data Lake for Real-Time Insight

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Transcription:

EECS 262a Advanced Topics in Computer Systems Lecture 17 Comparison of Parallel DB, CS, MR and Jockey October 30 th, 2013 John Kubiatowicz and Anthony D. Joseph Electrical Engineering and Computer Sciences University of California, Berkeley Today s Papers A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker. Appears in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2009 Jockey: Guaranteed Job Latency in Data Parallel Clusters Andrew D. Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca. Appears in Proceedings of the European Professional Society on Computer Systems (EuroSys), 2012 Thoughts? http://www.eecs.berkeley.edu/~kubitron/cs262 2 Two Approaches to Large-Scale Data Analysis Shared nothing MapReduce Distributed file system Map, Split, Copy, Reduce MR scheduler Parallel DBMS Standard relational tables, (physical location transparent) Data are partitioned over cluster nodes SQL Join processing: T1 joins T2» If T2 is small, copy T2 to all the machines» If T2 is large, then hash partition T1 and T2 and send partitions to different machines (this is similar to the split-copy in MapReduce) Query Optimization Intermediate tables not materialized by default Architectural Differences Parallel DBMS MapReduce Schema Support O X Indexing O X Programming Model Stating what you want (SQL) Presenting an algorithm (C/C++, Java, ) Optimization O X Flexibility Spotty UDF Support Good Fault Tolerance Not as Good Good Node Scalability <100 >10,000 3 4

Schema Support Programming Model & Flexibility MapReduce Flexible, programmers write code to interpret input data Good for single application scenario Bad if data are shared by multiple applications. Must address data syntax, consistency, etc. Parallel DBMS Relational schema required Good if data are shared by multiple applications MapReduce Low level: We argue that MR programming is somewhat analogous to Codasyl programming Anecdotal evidence from the MR community suggests that there is widespread sharing of MR code fragments to do common tasks, such as joining data sets. very flexible Parallel DBMS SQL user-defined functions, stored procedures, user-defined aggregates 5 6 Indexing Execution Strategy & Fault Tolerance MapReduce No native index support Programmers can implement their own index support in Map/Reduce code But hard to share the customized indexes in multiple applications Parallel DBMS Hash/b-tree indexes well supported MapReduce Intermediate results are saved to local files If a node fails, run the node-task again on another node At a mapper machine, when multiple reducers are reading multiple local files, there could be large numbers of disk seeks, leading to poor performance. Parallel DBMS Intermediate results are pushed across network If a node fails, must rerun the entire query 7 8

Avoiding Data Transfers Node Scalability MapReduce Schedule Map to close to data But other than this, programmers must avoid data transfers themselves Parallel DBMS A lot of optimizations Such as determine where to perform filtering MapReduce 10,000 s of commodity nodes 10 s of Petabytes of data Parallel DBMS <100 expensive nodes Petabytes of data 9 10 Performance Benchmarks Benchmark Environment Original MR task (Grep) Analytical Tasks Selection Aggregation Join User-defined-function (UDF) aggregation Node Configuration 100-node cluster Each node: 2.40GHz Intel Core 2 Duo, 64-bit red hat enterprise Linux 5 (kernel 2.6.18) w/ 4Gb RAM and two 250GB SATA HDDs. Nodes interconnected with Cisco Catalyst 3750E 1Gb/s switches Internal switching fabric has 128Gbps 50 nodes per switch Multiple switches interconnected via 64Gbps Cisco StackWise ring The ring is only used for cross-switch communications. 11 12

Tested Systems Hadoop (0.19.0 on Java 1.6.0) HDFS data block size: 256MB JVMs use 3.5GB heap size per node Rack awareness enabled for data locality Three replicas w/o compression: Compression or fewer replicas in HDFS does not improve performance DBMS-X (a parallel SQL DBMS from a major vendor) Row store 4GB shared memory for buffer pool and temp space per node Compressed table (compression often reduces time by 50%) Vertica Column store 256MB buffer size per node Compressed columns by default Benchmark Execution Data loading time: Actual loading of the data Additional operations after the loading, such as compressing or building indexes Execution time DBMS-X and vertica:» Final results are piped from a shell command into a file Hadoop:» Final results are stored in HDFS» An additional Reduce job step to combine the multiple files into a single file 13 14 Performance Benchmarks Benchmark Environment Original MR task (Grep) Analytical Tasks Selection Aggregation Join User-defined-function (UDF) aggregation Task Description From MapReduce paper Input data set: 100-byte records Look for a three-character pattern One match per 10,000 records Varying the number of nodes Fix the size of data per node (535MB/node) Fix the total data size (1TB) 15 16

Data Loading Data Loading Times Hadoop: Copy text files into HDFS in parallel DBMS-X: Load SQL command executed in parallel: it performs hash partitioning and distributes records to multiple machines Reorganize data on each node: compress data, build index, perform other housekeeping» This happens in parallel Vertica: Copy command to load data in parallel Data is re-distributed, then compressed 17 DBMS-X: grey is loading, white is re-organization after loading Loading is actually sequential despite parallel load commands Hadoop does better because it only copies the data to three HDFS replicas 18 Execution SQL: SELECT * FROM data WHERE field LIKE %XYZ% Full table scan MapReduce: Map: pattern search No reduce An additional Reduce job to combine the output into a single file Execution time grep Combine output Hadoop s large start-up cost shows up in Figure 4, when data per node is small Vertica s good data compression 19 20

Performance Benchmarks Benchmark Environment Original MR task (Grep) Analytical Tasks Selection Aggregation Join User-defined-function (UDF) aggregation Input Data Input #1: random HTML documents Inside an html doc, links are generated with Zipfian distribution 600,000 unique html docs with unique urls per node Input #2: 155 million UserVisits records 20GB/node Input #3: 18 million Ranking records 1GB/node 21 22 Selection Task Selection Task Find the pageurls in the rankings table (1GB/node) with a pagerank > threshold 36,000 records per data file (very selective) SQL: SELECT pageurl, pagerank FROM Rankings WHERE pagerank > X; MR: single Map, no Reduce Hadoop s start-up cost; DBMS uses index; vertica s reliable message layer becomes bottleneck 23 24

Aggregation Task Aggregation Task Calculate the total adrevenue generated for each sourceip in the UserVisits table (20GB/node), grouped by the sourceip column. Nodes must exchange info for computing groupby Generate 53 MB data regardless of number of nodes SQL: SELECT sourceip, SUM(adRevenue) FROM UserVisits GROUP BY sourceip; MR: Map: outputs (sourceip, adrevenue) Reduce: compute sum per sourceip Combine is used DBMS: Local group-by, then the coordinator performs the global group-by; performance dominated by data transfer. 25 26 Join Task Find the sourceip that generated the most revenue within Jan 15-22, 2000, then calculate the average pagerank of all the pages visited by the sourceip during this interval SQL: SELECT INTO Temp sourceip, AVG(pageRank) as avgpagerank, SUM(adRevenue) as totalrevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date( 2000-01-15 ) AND Date( 2000-01-22 ) GROUP BY UV.sourceIP; Map Reduce Phase 1: filter UserVisits that are outside the desired date range, joins the qualifying records with records from the Ranking file Phase 2: compute total adrevenue and average pagerank per sourceip Phase 3: produce the largest record SELECT sourceip, totalrevenue, avgpagerank FROM Temp ORDER BY totalrevenue DESC LIMIT 1; 27 28

Join Task UDF Aggregation Task Compute inlink count per document SQL: SELECT INTO Temp F(contents) FROM Documents; SELECT url, SUM(value) FROM Temp GROUP BY url; Need a user-defined-function to parse HTML docs (C pgm using POSIX regex lib) Both DBMS s do not support UDF very well, requiring separate program using local disk and bulk loading of the DBMS why was MR always forced to use Reduce to combine results? DBMS can use index, both relations are partitioned on the join key; MR has to read all data MR phase 1 takes an average 1434.7 seconds 600 seconds of raw I/O to read the table; 300 seconds to split, parse, deserialize; Thus CPU overhead is the limiting factor 29 MR: A standard MR program 30 UDF Aggregation Discussion Throughput experiments? Parallel DBMSs are much more challenging than Hadoop to install and configure properly DBMSs require professional DBAs to configure/tune Alternatives: Shark (Hive on Spark) Eliminates Hadoop task start-up cost and answers queries with subsecond latencies» 100 node system: 10 second till the first task starts, 25 seconds till all nodes run tasks Columnar memory store (multiple orders of magnitude faster than disk DBMS: lower UDF time; upper other query time Hadoop: lower query time; upper: combine all results into one 31 Compression: does not help in Hadoop? An artifact of Hadoop s Java-based implementation? Execution strategy (DBMS), failure model (Hadoop), ease of use (H/D) Other alternatives? Apache Hive, Impala (Cloudera), 32

Alternative: HadoopDB? The Basic Idea (An Architectural Hybrid of MR & DBMS) To use MR as the communication layer above multiple nodes running single-node DBMS instances Queries expressed in SQL, translated into MR by extending existing tools As much work as possible is pushed into the higher performing single node databases How many of complaints from Comparison paper still apply here? Hadapt startup commercializing Is this a good paper? What were the authors goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the Test of Time challenge? How would you review this paper today? 33 34 Domain for Jockey: Large cluster jobs BREAK Predictability very important Enforcement of s one way toward predictability 35 36

Variable Execution Latency: Prevalent 4.3x Job Model: Graph of Interconnected Stages Tasks Jo b Even for job with narrowest latency profile Over 4.3X variation in latency Reasons for latency variation: Pipeline complexity Noisy execution environment Excess resources 37 Stage 38 Dryad s Dag Workflow Compound Workflow Many simultaneous job pipelines executing at once Some on behalf of Microsoft, others on behalf of customers 39 Dependencies mean that deadlines on complete pipeline create deadlines on constituent jobs Median job s output used by 10 additional jobs 40

Best way to express performance targets Priorities? Not expressive enough Weights? Difficult for users to set Utility curves? Capture deadline & penalty Jockey s goal: Maximize utility while minimizing resources by dynamically adjusting the allocation 41 Application Modeling C(progress, allocation) remaining run time Techniques: Job simulator:» Input from profiling to a simulator which explores possible scenarios» Compute Amdahl s Law» Time = S + P/N» Estimate S and P from standpoint of current stage Progress metric? Many explored totalworkwithq: Total time completed tasks spent enqueued or executing Optimization: Minimum allocation that maximizes utility Control loop design: slack (1.2), hysteresis, dead zone (D) 42 Ex: Completion (1%), (50 min) Ex: Completion (3%), (50 min) 10 nodes 20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes 10 nodes 20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes JOCKEY CONTROL LOOP JOCKEY CONTROL LOOP 43 44

Ex: Completion (5%), (30 min) Jockey in Action 10 nodes 20 nodes 30 nodes 1% complete 60 minutes 40 minutes 25 minutes 2% complete 59 minutes 39 minutes 24 minutes 3% complete 58 minutes 37 minutes 22 minutes 4% complete 56 minutes 36 minutes 21 minutes 5% complete 54 minutes 34 minutes 20 minutes JOCKEY CONTROL LOOP Initial deadline: 140 minutes 45 46 Jockey in Action Jockey in Action Release resources due to excess pessimism New deadline: 70 minutes 47 New deadline: 70 minutes 48 48

Jockey in Action Jockey in Action Available parallelism less than allocation Oracle allocation: Total allocation-hours 49 Oracle allocation: Total allocation-hours 50 Jockey in Action Allocation above oracle 100% Evaluation 80% 1.4x Jockey CDF 60% 40% 20% deadline 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130% job completion time relative to deadline Oracle allocation: Total allocation-hours 51 Jobs which met the SLO 52

CDF 100% Evaluation Simulator made good predictions: 80% 60% 40% 20% 80% finish before deadline max allocation Allocation from simulator Jockey Control loop only deadline 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 110% 120% 130% job completion time relative to deadline Allocated too many resources Missed 1 of 94 deadlines Control loop is stable and successful Evaluation fraction of deadlines missed 20% Allocation from simulator Control loop only 15% 10% 5% Jockey max allocation 0% 0% 25% 50% 75% 100% fraction of allocation above oracle 53 54 Is this a good paper? What were the authors goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the Test of Time challenge? How would you review this paper today? 55