Analysis in the Big Data Era

Similar documents
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

A What-if Engine for Cost-based MapReduce Optimization

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

High Performance Computing on MapReduce Programming Framework

Improving the MapReduce Big Data Processing Framework

MixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp

Benchmarking Approach for Designing a MapReduce Performance Model

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

Introduction to Hadoop and MapReduce

Global Journal of Engineering Science and Research Management

Accelerate Big Data Insights

Oracle Big Data Connectors

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Hadoop Online Training

Native-Task Performance Test Report

DON T CRY OVER SPILLED RECORDS Memory elasticity of data-parallel applications and its application to cluster scheduling

Map Reduce Group Meeting

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Cloud Computing & Visualization

I am: Rana Faisal Munir

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Hadoop/MapReduce Computing Paradigm

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Big Data Analytics using Apache Hadoop and Spark with Scala

Introduction to BigData, Hadoop:-

BIG DATA COURSE CONTENT

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Programming Systems for Big Data

Hadoop & Big Data Analytics Complete Practical & Real-time Training

The Stratosphere Platform for Big Data Analytics

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Microsoft Big Data and Hadoop

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Hadoop An Overview. - Socrates CCDH

Big Data Hadoop Course Content

Expert Lecture plan proposal Hadoop& itsapplication

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

THE DEFINITIVE GUIDE FOR AWS CLOUD EC2 FAMILIES

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Databases 2 (VU) ( / )

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Big Data Hadoop Stack

CS 345A Data Mining. MapReduce

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Shark: SQL and Rich Analytics at Scale. Reynold Xin UC Berkeley

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

MapReduce Design Patterns

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Scientific Workflows and Cloud Computing. Gideon Juve USC Information Sciences Institute

April Copyright 2013 Cloudera Inc. All rights reserved.

MapR Enterprise Hadoop

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

microsoft

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

Developing MapReduce Programs

IV Statistical Modelling of MapReduce Joins

Improving Hadoop MapReduce Performance on Supercomputers with JVM Reuse

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Parallelizing Multiple Group by Query in Shared-nothing Environment: A MapReduce Study Case

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

PStorM: Profile Storage and Matching for Feedback-Based Tuning of MapReduce Jobs

An Improved Hadoop Performance Model For Job Estimation And Resource Provisioning In Public Cloud

Prototyping Data Intensive Apps: TrendingTopics.org

Flash Storage Complementing a Data Lake for Real-Time Insight

Tuning the Hive Engine for Big Data Management

Introduction to Data Management CSE 344

An Introduction to Big Data Formats

Chapter 5. The MapReduce Programming Model and Implementation

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

The Fusion Distributed File System

Introduction to Hive Cloudera, Inc.

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

Oracle Data Integrator 12c: Integration and Administration

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Innovatus Technologies

CS 345A Data Mining. MapReduce

QLIK INTEGRATION WITH AMAZON REDSHIFT

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Data warehousing on Hadoop. Marek Grzenkowicz Roche Polska

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Statistics Driven Workload Modeling for the Cloud

Transcription:

Analysis in the Big Data Era Massive Data Data Analysis Insight Key to Success = Timely and Cost-Effective Analysis 2

Hadoop MapReduce Ecosystem Popular solution to Big Data Analytics Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Hadoop HBase MapReduce Execution Engine Distributed File System 3

Prac;;oners of Big Data Analy;cs Who are the users? Data analysts, statisticians, computational scientists Researchers, developers, testers You! Who performs setup and tuning? The users! Usually lack expertise to tune the system 4

Tuning Challenges Heavy use of programming languages for MapReduce programs (e.g., Java/python) Data loaded/accessed as opaque files Large space of tuning choices Elasticity is wonderful, but hard to achieve (Hadoop has many useful mechanisms, but policies are lacking) Terabyte-scale data cycles 5

Starfish: Self- tuning System Our goal: Provide good performance automatically Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Analytics System Hadoop Starfish HBase MapReduce Execution Engine Distributed File System 6

What are the Tuning Problems? Job-level MapReduce configuration Cluster sizing J 1 J 2 Data layout tuning J 3 J 4 Workflow optimization Workload management 7

Starfish s Core Approach to Tuning Optimizers Search through space of tuning choices Job Cluster Data layout Profiler Collects concise summaries of execution Workflow Workload What- if Engine Estimates impact of hypothetical changes on execution 1) if Δ(conf. parameters) then what? 2) if Δ(data properties) then what? 3) if Δ(cluster properties) then what? 8

Starfish Architecture Workload Optimizer Elastisizer Profiler Workflow Optimizer Job Optimizer What-if Engine Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 9

MapReduce Job Execu;on job j = < program p, data d, resources r, configuration c > split 0 map split 2 map reduce out 0 split 1 map split 3 map reduce Out 1 Two Map Waves One Reduce Wave 10

What Controls MR Job Execu;on? job j = < program p, data d, resources r, configuration c > Space of configuration choices: Number of map tasks Number of reduce tasks Partitioning of map outputs to reduce tasks Memory allocation to task-level buffers Multiphase external sorting in the tasks Whether output data from tasks should be compressed Whether combine function should be used 11

Effect of Configura;on SeJngs Rules-of-thumb settings Two-dimensional projection of a multidimensional surface (Word Co-occurrence MapReduce Program) Use defaults or set manually (rules-of-thumb) Rules-of-thumb may not suffice 12

MapReduce Job Tuning in a Nutshell Goal: perf = F( p, d, r, c) c opt = arg min F( p, d, r, c) c S Challenges: p is an arbitrary MapReduce program; c is high-dimensional; Profiler What-if Engine Optimizer Runs p to collect a job profile (concise execution summary) of <p,d 1,r 1,c 1 > Given profile of <p,d 1,r 1,c 1 >, estimates virtual profile for <p,d 2,r 2,c 2 > Enumerates and searches through the optimization space S efficiently 13

Job Profile Concise representation of program execution as a job Records information at the level of task phases Generated by Profiler through measurement or by the What-if Engine through estimation Serialize, Memory map Partition Buffer Sort, split [Combine], [Compress] Merge DFS Read Map Collect Spill Merge 14

Job Profile Fields Dataflow: amount of data flowing through task phases Map output bytes Number of spills Number of records in buffer per spill Costs: execution times at the level of task phases Read phase time in the map task Map phase time in the map task Spill phase time in the map task Dataflow Statistics: statistical information about dataflow Width of input key-value pairs Map selectivity in terms of records Map output compression ratio Cost Statistics: statistical information about resource costs I/O cost for reading from local disk per byte CPU cost for executing the Mapper per record CPU cost for uncompressing the input per byte 15

Genera;ng Profiles by Measurement Goals Have zero overhead when profiling is turned off Require no modifications to Hadoop Support unmodified MapReduce programs written in Java or Hadoop Streaming/Pipes (Python/Ruby/C++) Approach: Dynamic (on-demand) instrumentation Event-condition-action rules are specified (in Java) Leads to run-time instrumentation of Hadoop internals Monitors task phases of MapReduce job execution We currently use Btrace (Hadoop internals are in Java) 16

Genera;ng Profiles by Measurement JVM Enable Profiling JVM split 0 map reduce out 0 raw data ECA rules raw data JVM split 1 map map profile reduce profile raw data job profile Use of Sampling Profile fewer tasks Execute fewer tasks JVM = Java Virtual Machine, ECA = Event-Condition-Action 17

What- if Engine Possibly Hypothetical Job Profile <p, d 1, r 1, c 1 > Input Data Properties <d 2 > Cluster Resources <r 2 > Configuration Settings <c 2 > What-if Engine Job Oracle Virtual Job Profile for <p, d 2, r 2, c 2 > Task Scheduler Simulator Properties of Hypothetical job 18

Virtual Profile Es;ma;on Given profile for job j = <p, d 1, r 1, c 1 > estimate profile for job j' = <p, d 2, r 2, c 2 > Profile for j Dataflow Statistics Cost Statistics Dataflow Costs Input Data d 2 Resources r 2 Relative Black-box Models (Virtual) Profile for j' Cardinality Models Cost Statistics White-box Models Costs Dataflow Statistics Dataflow Configuration c 2 White-box Models 19

Job Op;mizer Job Profile <p, d 1, r 1, c1> Input Data Properties <d 2 > Cluster Resources <r 2 > Just-in-Time Optimizer Subspace Enumeration Recursive Random Search What-if calls Best Configuration Settings <c opt > for <p, d 2, r 2 > 20

Workflow Op;miza;on Space Optimization Space Physical Logical Job-level Configuration Dataset-level Configuration Vertical Packing Partition Function Selection Join Selection Inter-job Inter-job 21

Op;miza;ons on TF- IDF Workflow D0 <{D},{W}> D1 D2 D4 J1 M1 R1 <{D, W},{f}> J2 J3, J4 M2 R2 <{D},{W, f, c}> M3 R3 M4 <{W},{D, t}> D0 <{D},{W}> J1, J2 M1 R1 M2 Logical R2 Optimization D2 <{D},{W, f, c}> J3, J4 D4 M3 R3 M4 <{W},{D, t}> Partition:{D} Sort: {D,W} Physical Optimization Reducers= 50 Compress = off Memory = 400 Reducers= 20 Compress = on Memory = 300 Legend D = docname f = frequency W = word c = count t = TF-IDF 22

New Challenges What-if challenges: Support concurrent job execution Estimate intermediate data properties J1 Workflow J2 Optimization challenges Interactions across jobs Extended optimization space Find good configuration settings for individual jobs J3 J4 23

Cluster Sizing Problem Use-cases for cluster sizing Tuning the cluster size for elastic workloads Workload transitioning from development cluster to production cluster Multi-objective cluster provisioning Goal Determine cluster resources & job-level configuration parameters to meet workload requirements 24

Mul;- objec;ve Cluster Provisioning Cloud enables users to provision clusters in minutes Running Time (min) Cost ($) 1,200 1,000 800 600 400 200 0 10.00 8.00 6.00 4.00 2.00 0.00 m1.small m1.large m1.xlarge c1.medium c1.xlarge m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type 25

Experimental Evalua;on Starfish (versions 0.1, 0.2) to manage Hadoop on EC2 Different scenarios: Cluster Workload Data EC2 Node Type CPU: EC2 units Mem I/O Perf. Cost / hour #Maps /node #Reds /node MaxMem /task m1.small 1 (1 x 1) 1.7 GB moderate $0.085 2 1 300 MB m1.large 4 (2 x 2) 7.5 GB high $0.34 3 2 1024 MB m1.xlarge 8 (4 x 2) 15 GB high $0.68 4 4 1536 MB c1.medium 5 (2 x 2.5) 1.7 GB moderate $0.17 2 2 300 MB c1.xlarge 20 (8 x 2.5) 7 GB high $0.68 8 6 400 MB cc1.4xlarge 33.5 (8) 23 GB very high $1.60 8 6 1536 MB 26

Experimental Evalua;on Starfish (versions 0.1, 0.2) to manage Hadoop on EC2 Different scenarios: Cluster Workload Data Abbr. MapReduce Program Domain Dataset CO Word Co-occurrence Natural Lang Proc. Wikipedia (10GB 22GB) WC WordCount Text Analytics Wikipedia (30GB 1TB) TS TeraSort Business Analytics TeraGen (30GB 1TB) LG LinkGraph Graph Processing Wikipedia (compressed ~6x) JO Join Business Analytics TPC-H (30GB 1TB) TF Term Freq. - Inverse Document Freq. Information Retrieval Wikipedia (30GB 1TB) 27

Job Op;mizer Evalua;on Hadoop cluster: 30 nodes, m1.xlarge Data sizes: 60-180 GB 60 Speedup over Default 50 40 30 20 10 0 TS WC LG JO TF CO MapReduce Programs Default Settings Rule-based Optimizer Cost-based Optimizer 28

Es;mates from the What- if Engine Hadoop cluster: 16 nodes, c1.medium MapReduce Program: Word Co-occurrence Data set: 10 GB Wikipedia True surface Estimated surface 29

Profiling Overhead Vs. Benefit Hadoop cluster: 16 nodes, c1.medium MapReduce Program: Word Co-occurrence Data set: 10 GB Wikipedia Percent Overhead over Job Running Time with Profiling Turned Off 35 30 25 20 15 10 5 0 1 5 10 20 40 60 80 100 Percent of Tasks Profiled Speedup over Job run with RBO Settings 2.5 2.0 1.5 1.0 0.5 0.0 1 5 10 20 40 60 80 100 Percent of Tasks Profiled 30

Mul;- objec;ve Cluster Provisioning Running Time (min) 1,200 1,000 800 600 400 200 0 m1.small m1.large m1.xlarge c1.medium c1.xlarge Actual Predicted Cost ($) 10.00 8.00 6.00 4.00 2.00 0.00 m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type for Target Cluster Instance Type for Source Cluster: m1.large Actual Predicted 31

More info: www.cs.duke.edu/starfish Job-level MapReduce configuration Cluster sizing J 1 J 2 Data layout tuning J 3 J 4 Workflow optimization Workload management 32