Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014

Research Areas Data-intensive computing Multi-Clouds Big Data 2 Ericsson Internal 2013-03-18 Page 2

Talk Outline Overview of Big Data Big data is here to stay and importance is increasing The Stratosphere data-analytics platform Apache Flink

What is Big Data? Small Data Big Data

What is Big Data? Big Data refers to datasets and flows large enough that has outpaced our capability to store, process, analyze, and understand

Why is Big Data Important in Science? In a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research. More data trumps better algorithms * The more data your models have from which to learn, the more accurate they become even if they weren t cutting-edge to begin with In speech recognition research increasing the model size by two orders of magnitude reduces the [word error rate] by 10% relative.. * The Unreasonable Effectiveness of Data [Halevey et al 09]

Big Data means Parallelization Read genome on 100 machines: ~10 seconds

Big Data Processing with No Data Locality Job( /genomes/jim.bam ) submit Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth is the bottleneck 1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5

MapReduce Data Locality Job( /genomes/jim.bam ) submit Job Tracker Task Task Task Task Task Task Tracker Tracker Tracker Tracker Tracker Tracker Job Job Job Job Job Job DN DN DN DN DN DN 1 2 3 2 5 6 4 3 6 3 5 6 1 2 4 1 4 5 R R = resultfile(s) R R

Hadoop 2.x Single Processing Framework Batch Apps Multiple Processing Frameworks Batch, Interactive, Streaming Hadoop 1.x Hadoop 2.x MapReduce (data processing) Others (spark, mpi, giraph, etc) MapReduce (resource mgmt, job scheduler, data processing) HDFS (distributed storage) YARN/Mesos (resource mgmt, job scheduler) HDFS (distributed storage)

OPEN Source Communities Ericsson Internal 2013-03-18 Page 11

New Data Processing Frameworks val input= TextFile(textInput) val words = input.flatmap { line => line.split( ) } val counts = words.groupby.count() { word => word } val output = counts.write (wordsoutput, RecordDataSinkFormat() ) val plan = new ScalaPlan(Seq(output)) Ericsson Internal 2013-03-18 Page 12

StraToSphere SQL Streaming Graphs ML High level Lang. MapReduce Stratosphere Mesos / YARN HDFS Spark 13

What is Stratosphere? An efficient distributed general-purpose data analysis platform Built on top of HDFS and YARN Focusing on ease of programming Ericsson Internal 2013-03-18 Page 14 14

Project status Research project started in 2009 by TU Berlin, HU Berlin, joined by SICS Now a growing open source project with first industrial installations Apache Incubator v0.4 - stable & documented, v0.5 beta status Ericsson Internal 2013-03-18 Page 15

Introducing Stratosphere General Purpose Data Analytics Platform. Database Technology MapReduce-style Technology Declarativity for SQL Optimizer Efficient Runtime Stratosphere Iterations Advanced Dataflows Declarativity Scalability User-defined functions (UDFs) Complex data types Schema on read Ericsson Internal 2013-03-18 Page 16 16

Stratosphere Stack Hive... Java API Scala API Spargel (graphs) Meteor (scripting) SQL,Python Hadoop MapReduce Stratosphere Optimizer Stratosphere Runtime Cluster Manager Direct YARN EC2 Ericsson Internal 2013-03-18 Page 17 Storage Local Files HDFS S3 JDBC 17...

Key Features Ericsson Internal 2013-03-18 Page 18 Easy to use developer APIs Java, Scala, Graphs, Nested Data (Python & SQL under development) Flexible composition of large programs High Performance Runtime Complex DAGs of operators In memory & out-of-core Data streamed between operations Automatic Optimization Join algorithms Operator chaining Reusing partitioning/sorting Native Iterations Embedded in the APIs Data streaming / in-memory Delta iterations speed up many programs by orders of mag. 18

Programming Model A program is expressed as an arbitrary data flow consisting of transformations, sources and sinks. Source Map Reduce Iterate Join Reduce Sink Source Map Ericsson Internal 2013-03-18 Page 19 19

Transformations Higher-order functions that execute user-defined functions in parallel on the input data.

Concise & rich APIs Basic Operators Map Reduce Join CoGroup Union Cross Iterate IterateDelta Ericsson Internal 2013-03-18 Page 21 Derived Operators Filter, FlatMap, Project Aggregate, Distinct Outer-Join, inner Join Vertex-Centric Graphs computation (Pregel style)... 21

Basic data operators Map Reduce Cross Match CoGroup 22 Ericsson Internal 2013-03-18 Page 22

Transformations: Map All pairs are independently processed. Map val input: DataSet[(Int, String)] =... val mapped = input.flatmap { case (value, words) => words.split(" ") } 23 Ericsson Internal 2013-03-18 Page 23

Ericsson Internal 2013-03-18 Page 24

Concise & rich APIs Word Count in Stratosphere Scala API Data source Transformation s val input = TextFile(textInput) val words = input.flatmap { line => line.split(" ").map { word => (word, 1)} } val counts = words.groupby {case (word, _) => word }.reduce { (w1, w2) => (w1._1, w1._2 + w2._2) } val output = counts.write(wordsoutput, CsvOutputFormat()) Data sink Ericsson Internal 2013-03-18 Page 25 25

Job graphs to execution graphs 26 Ericsson Internal 2013-03-18 Page 26

Joins in Stratosphere val large = env.readcsv(...) val medium = env.readcsv(...) val small = env.readcsv(...) large γ medium small joined1 = large.join(medium).where(_._3).isequalto(_._1).map{(left,right) =>...} joined2 = small.join(joined1).where(0).equals(2).map{ (left,right) =>...} result = joined2.groupby {_._3}.reduceGroup {el => e1.maxby {_._2}} Ericsson Internal 2013-03-18 Page 27 Built-in strategies include partitioned join and replicated join with local sort-merge or hybrid-hash algorithms. 27

Automatic Optimization DataSet<Tuple...> large = env.readcsv(...); DataSet<Tuple...> medium = env.readcsv(...); DataSet<Tuple...> small = env.readcsv(...); DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1).with(new JoinFunction() {... }); DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2).with(new JoinFunction() {... }); DataSet<Tuple...> result = joined2.groupby(3).aggregate(max, 2); Possible execution 2) Broadcast hash-join 1) Partitioned hash-join Ericsson Internal 2013-03-18 Page 28 Partitioned Reduce-side Broadcast Map-side 3) Grouping /Aggregation reuses the partitioning from step (1) No shuffle!!! 28

Distributed Runtime Master (Job Manager) handles job submission, scheduling, and metadata Workers (Task Managers) execute operations Data can be streamed between nodes All operators start in-memory and gradually go out-of-core Ericsson Internal 2013-03-18 Page 29 29

Input file Fault Tolerance Similar to Spark: tracks execution history to rebuild on failure by recomputation file.map(rec => (rec.type, 1)).reduce(_ + _).filter((type, count) => count > 10) map reduce filter Ericsson Internal 2013-03-18 Page 30

Runtime Architecture Comparison Ericsson Internal 2013-03-18 Page 32 empty page public class WC { public String word; public int count; } Pool of Memory Pages Works on pages of bytes Maps objects transparently to these pages Full control over memory, out-of-core enabled Algorithms work on binary representation Address individual fields (not deserialize whole object) Distributed Collection List[WC] Collections of objects General-purpose serializer (Java / Kryo) Limited control over memory & less efficient spilling Deserialize all or nothing 32

Iterative Programs SQL Streaming Graphs ML High level Lang. MapReduce Stratosphere Mesos / YARN HDFS Spark 33

Why Iterative Algorithms Algorithms that need iterations Clustering (K-Means, Canopy, ) Gradient descent (e.g., Logistic Regression, Matrix Factorization) Graph Algorithms (e.g., PageRank, Line-Rank, components, paths, reachability, centrality, ) Graph communities / dense sub-components Inference (believe propagation) Loop makes multiple passes over the data Ericsson Internal 2013-03-18 Page 34 34

Iterations in other systems Client Loop outside the system Step Step Step Step Step Client Loop outside the system Step Step Step Step Step Ericsson Internal 2013-03-18 Page 35 35

Iterations in Stratosphere Streaming dataflow with feedback red. map join join System is iteration-aware, performs automatic optimization 36 Ericsson Internal 2013-03-18 Page 36

Iteration Two types of iteration at stratosphere: Bulk iteration Delta iteration Both operators repeatedly invoke the step function on the current iteration state until a certain termination condition is reached 2014-09-09 S. Haridi, E2E Clouds 37 Ericsson Internal 2013-03-18 Page 37

Iteration Bulk Iteration In each iteration, the step function consumes the entire input (the result of the previous iteration, or the initial data set), and computes the next version of the partial solution A new version of the entire model in each iteration val input: DataSet[Int] =... def step(partial: DataSet[Int]) = { val nextpartial = partial.map { a => a + 1 } nextpartial } val numiter = 10; val iter = input.iterate(numiter, step) Ericsson Internal 2013-03-18 Page 38 S. Haridi, E2E Clouds 38

Iteration Delta Iteration Only parts of the model change in each iteration val input: DataSet[(Int, Int)] =... val initwset: DataSet[(Int, Int)] =... val initsset: DataSet[(Int, Int)] =... def step(ss: DataSet[Int], ws: DataSet[Int], ) = { val delta =... val nextworkset =... } val numiter = 10; val iter = input.iteratewirhwset( ) Ericsson Internal 2013-03-18 Page 39 39

Iteration Delta Iteration Connected Components Ericsson Internal 2013-03-18 Page 40 40

Ericsson Internal 2013-03-18 Page 41

Automatic Optimization for Iterative Programs Pushing work out of the loop Caching Loop-invariant Data Maintain state as index Ericsson Internal 2013-03-18 Page 42 42

# Vertices (thousands) Delta Iterations speed up certain problems by a lot Cover typical use cases of Pregel-like systems with comparable performance in a generic platform and developer API. Ericsson Internal 2013-03-18 Page 43 1400 1200 1000 800 600 Bulk 400 Delta 200 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Iteration Computations performed in each iteration for connected communities of a social graph 6000 5000 4000 3000 2000 1000 0 Twitter Webbase (20) Runtime (secs) 43

Thank you! Multi- Clouds Big Data 44