GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu University of California, Merced

Similar documents
GLADE: A Scalable Framework for Efficient Analytics. Florin Rusu (University of California, Merced) Alin Dobra (University of Florida)

GLADE: A Scalable Framework for Efficient Analytics

Actian Vector Benchmarks. Cloud Benchmarking Summary Report

Methods for Scalable Interactive Exploration of Massive Datasets. Florin Rusu, Assistant Professor EECS, School of Engineering

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Fault Tolerance in K3. Ben Glickman, Amit Mehta, Josh Wheeler

PF-OLA: A High-Performance Framework for Parallel Online Aggregation

Data Management in the Cloud: Limitations and Opportunities. Daniel Abadi Yale University January 30 th, 2009

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Shark: Hive (SQL) on Spark

Evolution From Shark To Spark SQL:

HadoopDB: An open source hybrid of MapReduce

Revisiting Aggregation for Data Intensive Applications: A Performance Study

CSE 530A. B+ Trees. Washington University Fall 2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

Introduction to Data Management CSE 344

Parallel In-situ Data Processing with Speculative Loading

Cloud Analytics Database Performance Report

Scalable I/O-Bound Parallel Incremental Gradient Descent for Big Data Analytics in GLADE

Shark: SQL and Rich Analytics at Scale. Yash Thakkar ( ) Deeksha Singh ( )

Parallel In-situ Data Processing Techniques

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads

CSE 344 MAY 2 ND MAP/REDUCE

OLTP vs. OLAP Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

CSE 190D Spring 2017 Final Exam Answers

Database in the Cloud Benchmark

Introduction to Data Management CSE 344

Advanced Database Systems

CMSC424: Database Design. Instructor: Amol Deshpande

Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Camdoop Exploiting In-network Aggregation for Big Data Applications Paolo Costa

Today s Papers. Architectural Differences. EECS 262a Advanced Topics in Computer Systems Lecture 17

CSE 190D Spring 2017 Final Exam

Distributed computing: index building and use

Outline. Database Management and Tuning. Outline. Join Strategies Running Example. Index Tuning. Johann Gamper. Unit 6 April 12, 2012

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Column Stores vs. Row Stores How Different Are They Really?

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia Final Exam. Administrivia Final Exam

Evaluation of Relational Operations

Developing MapReduce Programs

Parallel Computing: MapReduce Jin, Hai

MapReduce: A Programming Model for Large-Scale Distributed Computation

Benchmarking Cloud-based Data Management Systems

Chapter 12: Query Processing. Chapter 12: Query Processing

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Implementation of Relational Operations

Big Data Management and NoSQL Databases

A Fast and High Throughput SQL Query System for Big Data

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Pathfinder/MonetDB: A High-Performance Relational Runtime for XQuery

Chapter 12: Query Processing

Data Partitioning and MapReduce

Impala Intro. MingLi xunzhang

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia. Administrivia. Faloutsos/Pavlo CMU /615

Evaluation of Relational Operations. Relational Operations

Dremel: Interactice Analysis of Web-Scale Datasets

Main Memory and the CPU Cache

PARALLEL & DISTRIBUTED DATABASES CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Announcements. Database Systems CSE 414. Why compute in parallel? Big Data 10/11/2017. Two Kinds of Parallel Data Processing

Database Systems CSE 414

CSE 414: Section 7 Parallel Databases. November 8th, 2018

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Scaling Distributed Machine Learning with the Parameter Server

Parallel In-Situ Data Processing with Speculative Loading

MapReduce-style data processing

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Advanced Databases: Parallel Databases A.Poulovassilis

Track Join. Distributed Joins with Minimal Network Traffic. Orestis Polychroniou! Rajkumar Sen! Kenneth A. Ross

Chapter 13: Query Processing

Dtb Database Systems. Announcement

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Heckaton. SQL Server's Memory Optimized OLTP Engine

Guoping Wang and Chee-Yong Chan Department of Computer Science, School of Computing National University of Singapore VLDB 14.

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Cost-based Query Sub-System. Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Last Class.

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

arxiv: v1 [cs.db] 24 Aug 2015

COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

One Trillion Edges. Graph processing at Facebook scale

CS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

A Examcollection.Premium.Exam.47q

The Stratosphere Platform for Big Data Analytics

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Query Processing Models

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

Chapter 13: Query Processing Basic Steps in Query Processing

Distributed computing: index building and use

UC Riverside UC Riverside Electronic Theses and Dissertations

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Transcription:

GLADE: A Scalable Framework for Efficient Analytics Florin Rusu University of California, Merced

Motivation and Objective Large scale data processing Map-Reduce is standard technique Targeted to distributed index computation Hadoop is standard system Scalable, but inefficient Design and implement a system for aggregate computation Large scale data Exact and approximate Complex aggregates Architecture-independent Efficient and scalable 2

Outline DataPath Query Execution Engine Storage Manager Waypoints & Work-units Generalized Linear Aggregates (GLA) GLADE Shared-Memory Shared-Nothing GLA vs. Map-Reduce 3

DataPath [Arumugam et al. 2010] Multi-query analytical execution engine Shared scan & processing Push-driven execution Data pushed through operators at speed read from disk Chunk-at-a-time processing [MapReduce 2004] Sequential scan (large pages) Vectorized execution (loop unrolling) [MonetDB/X100 2007] Runtime code generation, compilation, loading, and execution M4 templates instantiated into C code with query attribute types and arithmetical & logical expressions 4

Architecture Query 1 5 Thread pool 2 Controller 4 Chunk Storage Manager 3 Path Graph Selection Waypoint Join Waypoint Selection 5 Waypoint

Storage Manager Table (relation) layout Horizontal partitioning into chunks Chunk layout Vertical partitioning on columns Limit on number of rows (2 million) Column layout Page size (512KB) Fixed size Array of data Variable size Array of indexes Data 6

Waypoints & Work-units Waypoint Controller for specific operation Selection, Join, etc. State Tasks = work-units Work-unit Code generated & loaded at runtime Selection waypoint No state Work-units: Filter Join waypoint State: hash table Work-units: BuildHash, MergeHash, Probe 7

Query Execution When a new query arrives in the system, the code to be executed is first generated, then compiled and loaded into the execution engine. Essentially, the waypoints are configured with the code (work-units) to execute for the new query. Once the storage manager starts to produce chunks for the new query, they are routed to waypoints based on the query execution plan (path graph). If there are available threads in the system, a chunk and the work-unit selected by its waypoint are sent to a worker thread for execution. When the worker thread finishes a work-unit, it is returned to the pool and the chunk is routed to the next waypoint. 8

Outline DataPath Query Execution Engine Storage Manager Waypoints & Work-units Generalized Linear Aggregates (GLA) GLADE Shared-Memory Shared-Nothing GLA vs. Map-Reduce 9

User-Defined Aggregates (UDA) [Wang and Zaniolo 2000, Cohen 2006] AGGREGATE Avg(Next Int) : Real { TABLE state(tsum Int, cnt Int); INITIALIZE : { INSERT INTO state VALUES (Next, 1); ITERATE : { UPDATE state SET tsum = tsum + Next, cnt = cnt + 1; TERMINATE : { INSERT INTO RETURN SELECT tsum / cnt FROM state; public class Avg { int sum; int count; public void Init() { sum = 0; count = 1; public void Accumulate(int newval) { sum = sum + newval; count = count + 1; public void Merge(Avg other) { sum = sum + other.sum; count = count + other.count; public double Terminate() { return (double)sum / count; 10

Generalized Linear Aggregates (GLA) Aggregates computed through sequential passes over data (multiple) State UDA interface Init Accumulate Merge Terminate Serialize/Deserialize Define and invoke in C++ Direct access to state public class Avg { int sum; int count; public void Init() { sum = 0; count = 1; public void Accumulate(int newval) { sum = sum + newval; count = count + 1; public void Merge(Avg other) { sum = sum + other.sum; count = count + other.count; public double Terminate() { return (double)sum / count; ARCHIVER_SIMPLE_DEFINITION() 11

Outline DataPath Query Execution Engine Storage Manager Waypoints & Work-units Generalized Linear Aggregates (GLA) GLADE Shared-Memory Shared-Nothing GLA vs. Map-Reduce 12

GLADE GLA Distributed Engine User defines GLA User invokes GLA with data on DataPath Execute right near data Take advantage of full parallelism Single machine (GLADE Shared-Memory) Multiple machines (GLADE Shared-Nothing) User gets back computed GLA Hadoop + Map-Reduce :: DataPath + GLA 13

GLADE Shared-Memory GLA Waypoint State List of GLAs Work-units Accumulate Merge When a chunk needs to be processed, the GLA Waypoint extracts a GLA state from the list and passes it together with the chunk to the Accumulate work-unit. The task executed by the work-unit is to call Accumulate for each tuple such that the GLA is updated with all the tuples in the chunk. If no GLA state is passed with the work-unit, a new GLA is created and initialized (Init) inside the work-unit, such that a GLA is always sent back to the GLA Waypoint. When all the chunks are processed, the list of GLA states has to be merged. Notice that the maximum number of GLA states that can be created is bounded by the number of threads in the pool. The merging of two GLA states is done by the Merge work-unit which calls Merge on the two states. 14

Experimental Data [Pavlo et al. 2009] CREATE TABLE UserVisits ( sourceip VARCHAR(16), // internally INT (4 bytes) desturl VARCHAR(100), visitdate DATE, // internally INT (4 bytes) adrevenue FLOAT, useragent VARCHAR(64), countrycode VARCHAR(3), languagecode VARCHAR(6), searchword VARCHAR(32), duration INT); 155 million tuples, approx. 20GB 15

Experimental Tasks Average computes the average time a user spends on a web page SELECT AVG(duration) FROM UserVisits Group By computes the ad revenue generated by a user across all the visited web pages SELECT SUM(adRevenue) FROM UserVisits GROUP BY sourceip Top-K determines the users who generated the largest one hundred (top-100) ad revenues on a single visit SELECT TOP 100 sourceip, adrevenue FROM UserVisits ORDER BY adrevenue DESC K-Means calculates the five most representative (5 centers) ad revenues 16

Experimental Setup Mid-level server CPU: 48 AMD Opteron cores @ 1.9GHz Memory: 256GB Disk: 76 hard-disks 3 RAID controllers 50MB/s disk bandwidth Ubuntu 10.04 SMP Server, kernel 2.6.32-26 (64-bit) Data One UserVisits instance (20GB) per disk Total maximum 1.3TB 17

Experimental Results Average K-Means Average Time per Iteration 40 35 35 30 Time (seconds) 30 25 20 15 10 Time (seconds) 25 20 15 10 5 5 0 1 2 4 8 16 32 48 64 0 1 2 4 8 16 32 48 64 No of disks No of disks Top-K Group By 60 140 50 120 Time (seconds) 40 30 20 Time (seconds) 100 80 60 40 10 20 0 1 2 4 8 16 32 48 64 No of disks 0 18 1 2 4 8 16 32 48 64 No of disks

GLADE Shared-Nothing Node 0 5. Merge(GLA) 6. Terminate() GLA GLA 0 GLA 1 GLA 2 4. Deserialize() 3. Merge(GLA) GLA 1. Init() GLA GLA GLA 2. Accumulate(t) Chunk Chunk Chunk Node 1 4. Serialize() GLA 1 Node 2 4. Serialize() GLA 2 3. Merge(GLA) GLA 3. Merge(GLA) GLA 1. Init() GLA GLA GLA 1. Init() GLA GLA GLA 2. Accumulate(t) 2. Accumulate(t) Chunk Chunk Chunk Chunk Chunk Chunk 19

Query Execution The coordinator generates the code to be executed at each waypoint in the DataPath execution plan. A single execution plan is used for all the workers. The coordinator creates an aggregation tree connecting all the workers. The tree is used for innetwork aggregation of the GLAs. The execution plan, the code, and the aggregation tree information are broadcasted to all the workers. Once the worker configures itself with the execution plan and loads the code, it starts to compute the GLA for its local data. This happens exactly in the same manner as for GLADE Shared-Memory. When a worker completes the computation of the local GLA, it first communicates this to the coordinator the coordinator uses this information to monitor how the execution evolves. If the worker is a leaf, it sends the serialized GLA to its parent in the aggregation tree immediately. A non-leaf node has one more step to execute. It needs to aggregate the local GLA with the GLAs of its children. For this, it first deserializes the external GLAs and then executes another round of Merge work-units. In the end, it sends the combined GLA to the parent. The worker at the root of the aggregation tree calls the method Terminate before sending the final GLA to the coordinator who passes it further to the client who sent the job. 20

Experimental Setup 17 node cluster 1 coordinator node, 16 worker nodes CPU: 4 AMD Opteron cores @ 2.4GHz Memory: 4GB Disk: 1 hard-disk @ 50MB/s bandwidth Network: 1Gb/s (125GB/s) switch Ubuntu 7.4 Server, kernel 2.6.20-16 (32-bit) Data One UserVisits instance (20GB) per node Total maximum 320GB 21

Average Experimental Results K-Means Average Time per Iteration 20 20 16 16 Time (seconds) 12 8 4 Time (seconds) 12 8 4 0 1 2 4 8 16 No of machines Top-K 0 1 2 4 8 16 No of machines Group By 35 60 30 50 Time (seconds) 25 20 15 10 Time (seconds) 40 30 20 5 10 0 1 2 4 8 16 No of machines 0 22 1 2 4 8 16 No of machines

Outline DataPath Query Execution Engine Storage Manager Waypoints & Work-units Generalized Linear Aggregates (GLA) GLADE Shared-Memory Shared-Nothing GLA vs. Map-Reduce 23

GLA vs. Map-Reduce GLA State well-defined Interface Init Accumulate Merge Terminate Turing complete Runtime executes only user code Only point-to-point communication More intuitive Algorithms for complex analytics already defined Map-Reduce No state (only key-value pairs) Interface Map Combine Reduce Turing complete Runtime executes sorting & grouping All-to-all communication Aggregate computation difficult to express Extra merging job Single reducer Extra communication 24

Hadoop vs GLADE Hadoop Read Group By 1400 1800 1200 1600 1400 1000 1200 Time (seconds) 800 600 Time (seconds) 1000 800 Glade Hadoop 600 400 400 200 200 0 1 2 4 8 16 No of machines 0 1 2 4 8 16 No of machines 25

Conclusions and Future Work GLA Express complex aggregates through intuitive interface User code GLADE Architecture-independent Efficient I/O bound Scalable Library of template aggregates GLA extension for approximate query processing Online aggregation Fault-tolerance Bulk-loading 26

Collaborators: Alin Dobra, University of Florida Questions??? 27