Workload-Aware Data Partitioning in CommunityDriven Data Grids

Similar documents
Scalable Community-Driven Data Sharing in e-science Grids

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Distributed Systems. 05r. Case study: Google Cluster Architecture. Paul Krzyzanowski. Rutgers University. Fall 2016

Finding a needle in Haystack: Facebook's photo storage

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

SDSS Dataset and SkyServer Workloads

Parallel Databases C H A P T E R18. Practice Exercises

Massive Scalability With InterSystems IRIS Data Platform

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

CLUSTERING HIVEMQ. Building highly available, horizontally scalable MQTT Broker Clusters

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

<Insert Picture Here> Oracle Coherence & Extreme Transaction Processing (XTP)

INSPIRE and Service Level Management Why it matters and how to implement it

Databases in the Cloud

Chapter 20: Database System Architectures

Mitigating Data Skew Using Map Reduce Application

The Virtual Observatory and the IVOA

Lecture 23 Database System Architectures

HYBRID TRANSACTION/ANALYTICAL PROCESSING COLIN MACNAUGHTON

Distributed Meta-data Servers: Architecture and Design. Sarah Sharafkandi David H.C. Du DISC

Towards Energy Proportionality for Large-Scale Latency-Critical Workloads

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France

Assignment 5. Georgia Koloniari

Datacenter replication solution with quasardb

Introduction to the Active Everywhere Database

The Google File System

Sinbad. Leveraging Endpoint Flexibility in Data-Intensive Clusters. Mosharaf Chowdhury, Srikanth Kandula, Ion Stoica. UC Berkeley

RBF: A New Storage Structure for Space- Efficient Queries for Multidimensional Metadata in OSS

Take Back Lost Revenue by Activating Virtuozzo Storage Today

Peta-Scale Simulations with the HPC Software Framework walberla:

Chapter 24 NOSQL Databases and Big Data Storage Systems

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

April 21, 2017 Revision GridDB Reliability and Robustness

Facebook Tao Distributed Data Store for the Social Graph

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

Strategic Briefing Paper Big Data

An Adaptive Online System for Efficient Processing of Hierarchical Data

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Oracle NoSQL Database Overview Marie-Anne Neimat, VP Development

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Advanced Databases: Parallel Databases A.Poulovassilis

Distributed KIDS Labs 1

VIRTUAL OBSERVATORY TECHNOLOGIES

Massively Parallel Processing. Big Data Really Fast. A Proven In-Memory Analytical Processing Platform for Big Data

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

SURVEY ON LOAD BALANCING AND DATA SKEW MITIGATION IN MAPREDUCE APPLICATIONS

Towards Scalable Data Management for Map-Reduce-based Data-Intensive Applications on Cloud and Hybrid Infrastructures

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Storage on the Lunatic Fringe. Thomas M. Ruwart University of Minnesota Digital Technology Center Intelligent Storage Consortium

Tools for Social Networking Infrastructures

When, Where & Why to Use NoSQL?

Hybrid Data Platform

Using Statistics for Computing Joins with MapReduce

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Visualization-supported Analysis of System Data for Controlled VMI-based Intrusion Detection

Dynamic Metadata Management for Petabyte-scale File Systems

e-infrastructures in FP7 INFO DAY - Paris

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

MOHA: Many-Task Computing Framework on Hadoop

SCHISM: A WORKLOAD-DRIVEN APPROACH TO DATABASE REPLICATION AND PARTITIONING

April 2010 Rosen Shingle Creek Resort Orlando, Florida

The Intersection of Cloud & Solid State Storage

HYRISE In-Memory Storage Engine

LH*Algorithm: Scalable Distributed Data Structure (SDDS) and its implementation on Switched Multicomputers

Decentralized Distributed Storage System for Big Data

Use Cases for Partitioning. Bill Karwin Percona, Inc

DDN About Us Solving Large Enterprise and Web Scale Challenges

Fujitsu/Fujitsu Labs Technologies for Big Data in Cloud and Business Opportunities

Recent Advances in Analytical Modeling of SSD Garbage Collection

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

2014 年 3 月 13 日星期四. From Big Data to Big Value Infrastructure Needs and Huawei Best Practice

Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani

Oracle and Tangosol Acquisition Announcement

EMC ISILON HARDWARE PLATFORM

ECMWF's Next Generation IO for the IFS Model and Product Generation

Storage Optimization with Oracle Database 11g

data parallelism Chris Olston Yahoo! Research

So you think you know everything about Partitioning?

From data center OS to Cloud architectures The future is Open Syed M Shaaf

Oracle Database 10G. Lindsey M. Pickle, Jr. Senior Solution Specialist Database Technologies Oracle Corporation

Huge market -- essentially all high performance databases work this way

Matthias Wobben working in Berlin, Germany. Senior Sales Engineer at Nextcloud

A Data Diffusion Approach to Large Scale Scientific Exploration

context: massive systems

Cisco SAN Analytics and SAN Telemetry Streaming

Jingren Zhou. Microsoft Corp.

Experiment-Driven Evaluation of Cloud-based Distributed Systems

Migrating to MySQL. Ted Wennmark, consultant and cluster specialist. Copyright 2014, Oracle and/or its its affiliates. All All rights reserved.

HANDLING DATA SKEW IN MAPREDUCE

Native Support of Multi-tenancy in RDBMS for Software as a Service

Functional Testing of SQL Server on Kaminario K2 Storage

SAT, SMT and QBF Solving in a Multi-Core Environment

Adaptive Query Processing on Prefix Trees Wolfgang Lehner

Architekturen für die Cloud

Dr. Angelika Reiser Chair for Database Systems (I3)

Transcription:

Workload-Aware Data Partitioning in CommunityDriven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin Gufler, Angelika Reiser, and Alfons Kemper Department of Computer Science, Germany

? e t Many challenges and opportunities in e-science for database a c i l research p e High-throughput data managementr r o Correlation of distributedidata sources t l p Community-driven data grids S I Dealing lwith d data skew and query hot spots u o h Workload-awareness by employing cost model during S partitioning 2

Query Load Balancing via Partitioning 3

Query Load Balancing via Partitioning 4

Query Load Balancing via Partitioning 5

Query Load Balancing via Partitioning X 6

Query Load Balancing via Replication 7

Query Load Balancing via Replication 8

Query Load Balancing via Replication 9

The AstroGrid-D Project German Astronomy Community Grid http://www.gac-grid.org/ Funded by the German Ministry of Education and Research Part of D-Grid 10

Up-Coming Data-Intensive Applications Alex Szalay, Jim Gray (Nature, 2006): Science in an exponential world Data rates LHC Terabytes a day/night Petabytes a year LHC LSST LOFAR Pan-STARRS LOFAR 11

The Multiwavelength Milky Way http://adc.gsfc.nasa.gov/mw/ 12

Research Challenges Directly deal with Terabyte/Petabyte-scale data sets Integrate with existing community infrastructures High throughput for growing user communities 13

Current Sharing in Data Grids Data autonomy Policies allow partners to access data Each institution ensures Availability (replication) Scalability Various organizational structures [Venugopal et al. 2006]: Centralized Hierarchical Federated Hybrid 14

Community-Driven Data Grids (HiSbase) 15

Distribute by Region not by Archive! 16

Distribute by Region not by Archive! 17

Distribute by Region not by Archive! 18

Distribute by Region not by Archive! 19

Mapping Data to Nodes 20

Workload-Aware Training Phase Incorporate query traces during training phase Base partitioning scheme on Data load Query load Challenges Balance query load without losing data load balancing Approximate real query hot spots from query sample 21

Dealing with Query Hot Spots Query skew triggered by increased interest in particular subsets of the data Two well-known query load balancing techniques: Data partitioning Data replication Finding trade-offs between both 22

When to Split (Partition) or to Replicate Considers partition characteristics Amount of data (few/many data points) Number of queries (few/many queries) Extent of regions and queries (small/big queries) Data points Few Queries Many Queries Small Big Small Big Few SPLIT REPLICATE Many SPLIT SPLIT SPLIT REPLICATE 23

Region Weight Functions Data only (#objects in a region) Queries only (#queries in a region) Scaled queries Approximate real extent of hot spot Avoid overfitting to training query set Heat of a region (#objects * #queries) Extents of regions and queries Replicate when many big queries big small 24

Evaluation Weight functions: data, heat, extent Data sets (observational, simulation) Workloads (SDSS query log, synthetic) Partitioning Scheme Properties Load distribution Communication overhead Throughput Measurements Distributed setup FreePastry simulator Pobs 25

Load Distribution Uniform data set from the Millennium simulation Workload with extreme hot spot In the following: 1024 partitions Heat of a region (#data * #queries) Normalized across all partitioning schemes 26

Query-unaware Training 27

Training with Scaled Queries (scaled 50x) 28

Training with Scaled Queries (scaled 400x) 29

Heat-based, Extent-based Training 30

Communication Overhead for Pobs 31

Throughput for Pobs 32

Load Balancing During Runtime Complement workload-aware partitioning with runtime loadbalancing Short-term peaks Master-slave approach Load monitoring Long-term trends Based on load monitoring Histogram evolution 33

Related Work On-line load balancing Hundreds of thousands to millions of nodes Reacting fast Treating objects individually HiSbase 34

Should I Split or Replicate? Many challenges and opportunities in e-science for database research High-throughput data management Correlation of distributed data sources Community-driven data grids Dealing with data skew and query hot spots Workload-awareness by employing cost model during partitioning 35

Get in Touch Database systems group, TU München Web site: http://www-db.in.tum.de E-mail: scholl@in.tum.de The HiSbase project http://www-db.in.tum.de/research/projects/hisbase/ Thank You for Your Attention 36

Queries Intersecting Multiple Regions 37

Regions Without Queries 38

Throughput for Pobs (300 nodes, sim.) 39

Throughput for Pobs (1000 nodes, sim.) 40

Throughput (Region-Uniform Queries) 41