Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

Similar documents
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Analysis in the Big Data Era

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Embedded Technosolutions

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

10 Million Smart Meter Data with Apache HBase

Event Stores (I) [Source: DB-Engines.com, accessed on August 28, 2016]

Hadoop Map Reduce 10/17/2018 1

MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs

Intro To Big Data. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2017

Challenges for Data Driven Systems

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

CISC 7610 Lecture 2b The beginnings of NoSQL

CIB Session 12th NoSQL Databases Structures

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Stages of Data Processing

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Big Data Hadoop Course Content

Oracle Big Data Fundamentals Ed 2

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Databases 2 (VU) ( / )

Big Data Architect.

Advanced Database Technologies NoSQL: Not only SQL

Review - Relational Model Concepts

LITERATURE SURVEY (BIG DATA ANALYTICS)!

Distributed Databases: SQL vs NoSQL

The Evolution of Big Data Platforms and Data Science

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

CSE 344 Final Review. August 16 th

Big Data Hadoop Stack

Big Data Management and NoSQL Databases

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Databricks, an Introduction

Presented by Nanditha Thinderu

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Integrating Oracle Databases with NoSQL Databases for Linux on IBM LinuxONE and z System Servers

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

Data Partitioning and MapReduce

Oracle Big Data Connectors

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization

1Z Oracle Big Data 2017 Implementation Essentials Exam Summary Syllabus Questions

DATABASE DESIGN II - 1DL400

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Open Data Standards for Administrative Data Processing

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

What is Data Warehouse like

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

Google big data techniques (2)

Hadoop An Overview. - Socrates CCDH

Project Revision. just links to Principles of Information and Database Management 198:336 Week 13 May 2 Matthew Stone

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Shark: Hive (SQL) on Spark

microsoft

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Document databases Graph databases Metadata Column databases

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Safe Harbor Statement

SCHEMA INFERENCE FOR MASSIVE JSON DATASETS!

Hadoop, Yarn and Beyond

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Databases and Big Data Today. CS634 Class 22

Accelerate Big Data Insights

Shark: Hive (SQL) on Spark

Using Alluxio to Improve the Performance and Consistency of HDFS Clusters

Cloud Computing & Visualization

Big Data with Hadoop Ecosystem

A Fast and High Throughput SQL Query System for Big Data

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Innovatus Technologies

The Stratosphere Platform for Big Data Analytics

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

COSC 304 Introduction to Database Systems. NoSQL Databases. Dr. Ramon Lawrence University of British Columbia Okanagan

745: Advanced Database Systems

Graph Analytics using Vertica Relational Database

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

YeSQL: Battling the NoSQL Hype Cycle with Postgres

I am: Rana Faisal Munir

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

Evolving To The Big Data Warehouse

April Copyright 2013 Cloudera Inc. All rights reserved.

Presented by Sunnie S Chung CIS 612

Introduction to Big-Data

Transcription:

Research Works to Cope with Big Data Volume and Variety Jiaheng Lu University of Helsinki, Finland

Big Data: 4Vs Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html

Hadoop and Spark platform optimization Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html

Multi-model databases: quantum framework and category theory Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html

Outline Big data platform optimization (10 mins) Motivation Two main principles and approaches Experimental results Multi-model databases (10 mins) Overview Quantum framework Category theory 5

Optimizing parameters in Hadoop and Spark platforms

Analysis in the Big Data Era Data Analysis Massive data Decision making Insights Saving and Revenue Key to success = Timely and Cost-effective analysis 7

Analysis in the Big Data Era Popular big data platform Hadoop and Spark Burden on users Responsible for provisioning and configuration Usually lack expertise to tune the system 8

Analysis in the Big Data Era Popular big data platform Hadoop and Spark Burden on users Responsible for provisioning and configuration Usually lack expertise to tune the system As a data scientist, I do not know how to improve the efficiency of my job? 9

Analysis in the Big Data Era Popular big data platform Hadoop and Spark Burden on users: provision and tuning Effect of system-tuning for jobs Tuned vs. Default Running time Often 10x System resource utilization Often 10x Others Well tuned jobs may avoid failures like OOM, out of disk, job time out, etc. Good performance after tuning 10

Automatic job optimization toolkit NOT our goal: Change the codes of the system to improve the efficiency Our goal: Configure the parameters to achieve good performance Our system is easy to be used in the existing Hadoop and Spark system 11

Problem definition Given a MapReduce or Spark job with input data and running cluster, we find the setting of parameters that optimize the execution time of the job. (i.e. minimize the job execution time) 12

Challenges: too many parameters! There are more than 190 parameters in Hadoop! 13

Two key ideas in job optimizer 1.Reduce search space! 190 13 4 14

13 parameters we tune 15

Four key factors We identify four key factors (parameters) to model a MapReduce job execution The number of Map task waves m. (number of Map tasks) The number of Reduce task waves r. (number of Reduce tasks) The Map output compression option c. (true or false) The copy speed in the Shuffle phase v (number of parallel copiers) 16

Cost model Producer: the time to produce the Map outputs in m waves the non-overlapped time to transport Map outputs to the Reduce side Transporter: Consumer: the time to produce Reduce outputs in r waves 17

Two key ideas in job optimizer 2. Keep everything busy! CPU: map, reduce and compression I/O: sort and merge Network: shuffle 18

Keep map and shuffle parallel 19

MRTunner approach: Given a new job for tuning Retrieve the profile data Search for the four key parameters Compute the 13 parameters Configure and run Good performance after tuning 20

Architecture Job Optimizer Online Offline Profile query engine Job Profile Data Profile Hadoop MR Log HDFS Resource Profile OS/Ganglia 21

Profile data Job profile Selectivity of Map input/output Selectivity of Reduce input/output Ratio of Map output compression Data profile Data Size Distribution of input key/value System profile Number of machines Network throughput Compression/Decompression throughput Overhead to create a map or reduce task 22

Experimental evaluation Performance Comparison Hadoop-X (Commercial Hadoop): Starfish: Parameters advised by Starfish MRTuner: Parameters advised by MRTuner Workloads Terasort N-gram Pagerank 23

Effectiveness of MRTuner Job Optimizer Running time of jobs Commercial Hadoop-X For N-gram job, MRTuner obtains more than 20x speedup than Hadoop-X 24

Comparison between Hadoop-X and MRTuner (N-gram) Cluster-wide Resource Usage from Ganglia CPU and Network utilizations are higher. Hadoop-X MRTuner 25

Impact of Parameters on Selected Jobs Compression is important Map task # is important Reduce task # is important MRTuner tuned time: t1 Results after changing parameters to Hadoop-X setting: t2 The impact: (t2-t1)/t1, then normalize all the impacts 26

Ongoing research topics Efficient job optimization on YARN and Spark Tune for container size and executor size 27

Multi-model databases: Quantum framework and category theory

Multi-model databases Photo downloaded from: https://blog.infodiagram.com/2014/04/visualizing-big-data-concepts-strong.html

Motivation: one application to include multi-model data An E-commerce example with multi-model data Sale history Recommend Shopping Cart Customer Product Catalog

NoSQL database types Photo downloaded from: http://www.vikramtakkar.com/2015/12/nosql-types-of-nosql-database-part-2.html

Multiple NoSQL databases MongoDB Sales Redis Shopping-cart Neo4j Social media MongoDB Customer MongoDB Catalog

Multi-model DB One unified database for multi-model data XML RDF Tabular Spatial Multi-model DB JSON Text

Challenge: a new theory foundation Call for a unified model and theory for multi-model data! The theory of relations (150 years old) is not adequate to mathematically describe modern (NoSQL) DBMS.

Two possible theory foundations Quantum framework; Approximate query processing for open field in multi-model databases Category theory: Exact query processing and schema mapping for close field in multi-model databases 35

Quantum framework Database is based on the components: Logic (SQL expressions) Algebra (relational algebra) The Quantum framework adds quantum probability and quantum algebra. 36

Why not classical probability Apply three rules on multi-model data Quantum superposition Quantum entanglement Quantum inference 37

Unifying multi-model data in Hilbert space Relation XML RDF Graph Use quantum probability to answer the query approximately 38

Two possible theory foundations Quantum framework; Approximate query processing for open field in multi-model databases Category theory: Exact query processing and schema mapping for close field in multi-model databases 39

What is category theory? It was invented in the early 1940 s Category theory has been proposed as a new foundation for mathematics (to replace set theory) A category has the ability to compose the arrows associatively

Unified data model One unified data model with objects and morphisms Relation XML RDF Graph

Unified language model One unified language model with functors XPath Xquery SPAQL Tabular SQL

Transformation Natrual transformation between multiple language for multi-model data

Ongoing research topics Approximate query processing based on quantum framework Multi-model data integration based on category theory 44

Conclusion 1. Parameter tuning is important for Big data platform like Hadoop and Spark. 2. Emerging two new theoretical foundations on multi-model databases: quantum framework and category theory 45

Reference Jiaheng Lu, Irena Holubová: Multi-model Data Management: What's New and What's Next? EDBT 2017: 602-605 Jiaheng Lu: Towards Benchmarking Multi-Model Databases. CIDR 2017 Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, Chen Wang: MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs. PVLDB 7(13): 1319-1330 (2014) 46

47