Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Similar documents
Practical Big Data Processing An Overview of Apache Flink

Apache Flink Big Data Stream Processing

NoSQL Matters, April Stephan Ewen

CSE 444: Database Internals. Lecture 23 Spark

Resilient Distributed Datasets

Big Data Infrastructures & Technologies

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

An Introduction to Apache Spark

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Unifying Big Data Workloads in Apache Spark

Spark Overview. Professor Sasu Tarkoma.

Chapter 4: Apache Spark

Big data systems 12/8/17

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

DATA SCIENCE USING SPARK: AN INTRODUCTION

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Lecture 11 Hadoop & Spark

Spark, Shark and Spark Streaming Introduction

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Processing of big data with Apache Spark

Real-time data processing with Apache Flink

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Apache Flink- A System for Batch and Realtime Stream Processing

Distributed Computing with Spark and MapReduce

Compile-Time Code Generation for Embedded Data-Intensive Query Languages

Large-Scale Graph Processing with Apache Flink GraphDevroom FOSDEM 15. Vasia Kalavri Flink committer & PhD

Batch Processing Basic architecture

CS Spark. Slides from Matei Zaharia and Databricks

Distributed Computing with Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

An Introduction to Apache Spark

Massive Online Analysis - Storm,Spark

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink

Shark: Hive (SQL) on Spark

2/4/2019 Week 3- A Sangmi Lee Pallickara

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

New Developments in Spark

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

Analytics in Spark. Yanlei Diao Tim Hunter. Slides Courtesy of Ion Stoica, Matei Zaharia and Brooke Wenig

Apache Spark 2.0. Matei

Bringing Data to Life

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Spark: A Brief History.

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

a Spark in the cloud iterative and interactive cluster computing

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Fast, Interactive, Language-Integrated Cluster Computing

Apache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides

The Evolution of Big Data Platforms and Data Science

Architecture of Flink's Streaming Runtime. Robert

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

Data-intensive computing systems

April Copyright 2013 Cloudera Inc. All rights reserved.

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Big Data Hadoop Stack

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Databases 2 (VU) ( / )

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

Spark and Spark SQL. Amir H. Payberah. SICS Swedish ICT. Amir H. Payberah (SICS) Spark and Spark SQL June 29, / 71

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

Big Data Hadoop Course Content

Thrill : High-Performance Algorithmic

Apache Spark and Scala Certification Training

Beyond MapReduce: Apache Spark Antonino Virgillito

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

The Stratosphere Platform for Big Data Analytics

Analyzing Flight Data

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

A Parallel R Framework

Hadoop Development Introduction

Shark: Hive (SQL) on Spark

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

Turning Relational Database Tables into Spark Data Sources

Certified Big Data Hadoop and Spark Scala Course Curriculum

Hadoop. Introduction / Overview

Introduction to Apache Spark. Patrick Wendell - Databricks

Hadoop An Overview. - Socrates CCDH

Specialist ICT Learning

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Webinar Series TMIP VISION

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Innovatus Technologies

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Big Data Analytics using Apache Hadoop and Spark with Scala

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Lecture 30: Distributed Map-Reduce using Hadoop and Spark Frameworks

Introduction to Spark

Harp-DAAL for High Performance Big Data Computing

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Transcription:

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014

Research Areas Data-intensive computing Multi-Clouds Big Data 2 Ericsson Internal 2013-03-18 Page 2

Talk Outline Overview of Big Data Big data is here to stay and importance is increasing The Stratosphere data-analytics platform Apache Flink

What is Big Data? Small Data Big Data

What is Big Data? Big Data refers to datasets and flows large enough that has outpaced our capability to store, process, analyze, and understand

Why is Big Data Important in Science? In a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research. More data trumps better algorithms * The more data your models have from which to learn, the more accurate they become even if they weren t cutting-edge to begin with In speech recognition research increasing the model size by two orders of magnitude reduces the [word error rate] by 10% relative.. * The Unreasonable Effectiveness of Data [Halevey et al 09]

Big Data means Parallelization Read genome on 100 machines: ~10 seconds

Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth is the bottleneck 1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5

MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Task Task Task Task Task Tracker Tracker Tracker Tracker Tracker Tracker Job Job Job Job Job Job DN DN DN DN DN DN 1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5 R R = resultfile(s) R R

Hadoop 2.x Single Processing Framework Batch Apps Multiple Processing Frameworks Batch, Interactive, Streaming Hadoop 1.x Hadoop 2.x MapReduce (data processing) Others (spark, mpi, giraph, etc) MapReduce (resource mgmt, job scheduler, data processing) HDFS (distributed storage) YARN/Mesos (resource mgmt, job scheduler) HDFS (distributed storage)

OPEN Source Communities Ericsson Internal 2013-03-18 Page 11

New Data Processing Frameworks val input= TextFile(textInput) val words = input.flatmap { line => line.split( ) } val counts = words.groupby.count() { word => word } val output = counts.write (wordsoutput, RecordDataSinkFormat() ) val plan = new ScalaPlan(Seq(output)) Ericsson Internal 2013-03-18 Page 12

StraToSphere SQL Streaming Graphs ML High level Lang. MapReduce Stratosphere Mesos / YARN HDFS Spark 13

What is Stratosphere? An efficient distributed general-purpose data analysis platform Built on top of HDFS and YARN Focusing on ease of programming Ericsson Internal 2013-03-18 Page 14 14

Project status Research project started in 2009 by TU Berlin, HU Berlin, joined by SICS Now a growing open source project with first industrial installations Apache Incubator v0.4 - stable & documented, v0.5 beta status Ericsson Internal 2013-03-18 Page 15

Introducing Stratosphere General Purpose Data Analytics Platform. Database Technology MapReduce-style Technology Declarativity for SQL Optimizer Efficient Runtime Stratosphere Iterations Advanced Dataflows Declarativity Scalability User-defined functions (UDFs) Complex data types Schema on read Ericsson Internal 2013-03-18 Page 16 16

Stratosphere Stack Hive... Java API Scala API Spargel (graphs) Meteor (scripting) SQL,Python Hadoop MapReduce Stratosphere Optimizer Stratosphere Runtime Cluster Manager Direct YARN EC2 Ericsson Internal 2013-03-18 Page 17 Storage Local Files HDFS S3 JDBC 17...

Key Features Ericsson Internal 2013-03-18 Page 18 Easy to use developer APIs Java, Scala, Graphs, Nested Data (Python & SQL under development) Flexible composition of large programs High Performance Runtime Complex DAGs of operators In memory & out-of-core Data streamed between operations Automatic Optimization Join algorithms Operator chaining Reusing partitioning/sorting Native Iterations Embedded in the APIs Data streaming / in-memory Delta iterations speed up many programs by orders of mag. 18

Programming Model A program is expressed as an arbitrary data flow consisting of transformations, sources and sinks. Source Map Reduce Iterate Join Reduce Sink Source Map Ericsson Internal 2013-03-18 Page 19 19

Transformations Higher-order functions that execute user-defined functions in parallel on the input data.

Concise & rich APIs Basic Operators Map Reduce Join CoGroup Union Cross Iterate IterateDelta Ericsson Internal 2013-03-18 Page 21 Derived Operators Filter, FlatMap, Project Aggregate, Distinct Outer-Join, inner Join Vertex-Centric Graphs computation (Pregel style)... 21

Basic data operators Map Reduce Cross Match CoGroup 22 Ericsson Internal 2013-03-18 Page 22

Transformations: Map All pairs are independently processed. Map val input: DataSet[(Int, String)] =... val mapped = input.flatmap { case (value, words) => words.split(" ") } 23 Ericsson Internal 2013-03-18 Page 23

Ericsson Internal 2013-03-18 Page 24

Concise & rich APIs Word Count in Stratosphere Scala API Data source Transformation s val input = TextFile(textInput) val words = input.flatmap { line => line.split(" ").map { word => (word, 1)} } val counts = words.groupby {case (word, _) => word }.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } val output = counts.write(wordsoutput, CsvOutputFormat()) Data sink Ericsson Internal 2013-03-18 Page 25 25

Job graphs to execution graphs 26 Ericsson Internal 2013-03-18 Page 26

Joins in Stratosphere val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) large γ medium small joined1 = large.join(medium).where(_._3).isequalto(_._1).map{(left,right) =>...} joined2 = small.join(joined1).where(0).equals(2).map{ (left,right) =>...} result = joined2.groupby {_._3}.reduceGroup {el => e1.maxby {_._2}} Ericsson Internal 2013-03-18 Page 27 Built-in strategies include partitioned join and replicated join with local sort-merge or hybrid-hash algorithms. 27

Automatic Optimization DataSet<Tuple...> large = env.readcsv(...); DataSet<Tuple...> medium = env.readcsv(...); DataSet<Tuple...> small = env.readcsv(...); DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1).with(new JoinFunction() {... }); DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2).with(new JoinFunction() {... }); DataSet<Tuple...> result = joined2.groupby(3).aggregate(max, 2); Possible execution 2) Broadcast hash-join 1) Partitioned hash-join Ericsson Internal 2013-03-18 Page 28 Partitioned Reduce-side Broadcast Map-side 3) Grouping /Aggregation reuses the partitioning from step (1) No shuffle!!! 28

Distributed Runtime Master (Job Manager) handles job submission, scheduling, and metadata Workers (Task Managers) execute operations Data can be streamed between nodes All operators start in-memory and gradually go out-of-core Ericsson Internal 2013-03-18 Page 29 29

Input file Fault Tolerance Similar to Spark: tracks execution history to rebuild on failure by recomputation file.map(rec => (rec.type, 1)).reduce(_ + _).filter((type, count) => count > 10) map reduce filter Ericsson Internal 2013-03-18 Page 30

Input file Fault Tolerance Similar to Spark: tracks execution history to rebuild on failure by recomputation file.map(rec => (rec.type, 1)).reduce(_ + _).filter((type, count) => count > 10) map reduce filter Ericsson Internal 2013-03-18 Page 31

Runtime Architecture Comparison Ericsson Internal 2013-03-18 Page 32 empty page public class WC { public String word; public int count; } Pool of Memory Pages Works on pages of bytes Maps objects transparently to these pages Full control over memory, out-of-core enabled Algorithms work on binary representation Address individual fields (not deserialize whole object) Distributed Collection List[WC] Collections of objects General-purpose serializer (Java / Kryo) Limited control over memory & less efficient spilling Deserialize all or nothing 32

Iterative Programs SQL Streaming Graphs ML High level Lang. MapReduce Stratosphere Mesos / YARN HDFS Spark 33

Why Iterative Algorithms Algorithms that need iterations Clustering (K-Means, Canopy, ) Gradient descent (e.g., Logistic Regression, Matrix Factorization) Graph Algorithms (e.g., PageRank, Line-Rank, components, paths, reachability, centrality, ) Graph communities / dense sub-components Inference (believe propagation) Loop makes multiple passes over the data Ericsson Internal 2013-03-18 Page 34 34

Iterations in other systems Client Loop outside the system Step Step Step Step Step Client Loop outside the system Step Step Step Step Step Ericsson Internal 2013-03-18 Page 35 35

Iterations in Stratosphere Streaming dataflow with feedback red. map join join System is iteration-aware, performs automatic optimization 36 Ericsson Internal 2013-03-18 Page 36

Iteration Two types of iteration at stratosphere: Bulk iteration Delta iteration Both operators repeatedly invoke the step function on the current iteration state until a certain termination condition is reached 2014-09-09 S. Haridi, E2E Clouds 37 Ericsson Internal 2013-03-18 Page 37

Iteration Bulk Iteration In each iteration, the step function consumes the entire input (the result of the previous iteration, or the initial data set), and computes the next version of the partial solution A new version of the entire model in each iteration val input: DataSet[Int] =... def step(partial: DataSet[Int]) = { val nextpartial = partial.map { a => a + 1 } nextpartial } val numiter = 10; val iter = input.iterate(numiter, step) Ericsson Internal 2013-03-18 Page 38 S. Haridi, E2E Clouds 38

Iteration Delta Iteration Only parts of the model change in each iteration val input: DataSet[(Int, Int)] =... val initwset: DataSet[(Int, Int)] =... val initsset: DataSet[(Int, Int)] =... def step(ss: DataSet[Int], ws: DataSet[Int], ) = { val delta =... val nextworkset =... } val numiter = 10; val iter = input.iteratewirhwset( ) Ericsson Internal 2013-03-18 Page 39 39

Iteration Delta Iteration Connected Components Ericsson Internal 2013-03-18 Page 40 40

Ericsson Internal 2013-03-18 Page 41

Automatic Optimization for Iterative Programs Pushing work out of the loop Caching Loop-invariant Data Maintain state as index Ericsson Internal 2013-03-18 Page 42 42

# Vertices (thousands) Delta Iterations speed up certain problems by a lot Cover typical use cases of Pregel-like systems with comparable performance in a generic platform and developer API. Ericsson Internal 2013-03-18 Page 43 1400 1200 1000 800 600 Bulk 400 Delta 200 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Iteration Computations performed in each iteration for connected communities of a social graph 6000 5000 4000 3000 2000 1000 0 Twitter Webbase (20) Runtime (secs) 43

Thank you! Multi- Clouds Big Data 44