Faster Workflows, Faster Ken Krugler President, Scale Unlimited
The Twitter Pitch Cascading is a solid, established workflow API Good for complex custom ETL workflows Flink is a new streaming dataflow engine 50% better performance by leveraging memory
Perils of Comparisons Performance comparisons are clickbait for devs You won t believe the speed of Flink! Really, really hard to do well I did moderate tuning of existing code base Somewhat complex workflow in EMR Dataset bigger than memory (100M 1B records)
TL;DR Flink gets faster by minimizing disk I/O Map-Reduce job always has write/read at job break Can also spill map output, reduce merge-sort Flink has no job boundaries And no map-side spills So only reduce merge-sort is extra I/O
TOC A very short intro to Cascading An even shorter intro to Flink An example of converting a workflow More in-depth results
In the beginning There was Hadoop, and it was good But life is too short to write M-R jobs with K-V data Then along came Cascading
Cascading
What is Cascading? A thin Java library on top of Hadoop An open source project (APL, 8 years old) An API for defining and running ETL workflows
30,000ft View Records (Tuples) flow through Pipes [ clay, boxing, 5 ]
30,000ft View Pipes connect Operations [ term, link, distance ] [ clay, boxing, 5 ] Filter by Distance
30,000ft View You do Operations on Tuples [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ] Filter by Distance [ term, link ] [ clay, boxing ]
30,000ft View Tuples flow into Pipes from Source Taps Hfs Tap [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ]
30,000ft View Tuples flow from Pipes into Sink Taps [ term, link, weight ] [ clay, boxing, 2.57 ] [ clay, pottery, 5.32 ] Hfs Tap
30,000ft View This is a data processing workflow (Flow) Hfs Tap [ term, link ] [ clay, boxing ] [ term, link, distance ] [ clay, boxing, 5 ] [ clay, olympics, 15 ] Group By Term/Link Count [ term, link, count ] Filter by Distance [ clay, boxing, 237 ] Hfs Tap
Java API to Define Flow Pipe ipdatapipe = new Pipe("ip data pipe"); RegexParser ipdataparser = new RegexParser(new Fields("Data IP", "Country"), ^([\\d\\.]+)\t(.*)); ipdatapipe = new Each(ipDataPipe, new Fields("line"), ipdataparser); Pipe loganalysispipe = new CoGroup( logdatapipe, // left-side pipe new Fields("Log IP"), // left-side field for joining ipdatapipe, // right-side pipe new Fields("Data IP"), // right-side field for joining new LeftJoin()); // type of join to do loganalysispipe = new GroupBy(logAnalysisPipe, new Fields("Country", "Status")); loganalysispipe = new Every(logAnalysisPipe, new Count(new Fields( Count"))); loganalysispipe = new Each(logAnalysisPipe, new Fields("country"), new Not(new RegexFilter("null"))); Tap logdatatap = new Hfs(new TextLine(), "access.log"); Tap ipdatatap = new Hfs(new TextLine(), "ip-map.tsv"); Tap outputtap = new Hfs(new TextLine(), "results"); FlowDef flowdef = new FlowDef().setName("log analysis flow").addsource(logdatapipe, logdatatap).addsource(ipdatapipe, ipdatatap).addtailsink(loganalysispipe, outputtap); Flow flow = new HadoopFlowConnector(properties).connect(flowDef);
Visualizing the DAG Hfs['TextLine[['offset', 'line']->[all]]']['src/test/resources/access.log']'] Hfs['TextLine[['offset', 'line']->[all]]']['src/test/resources/ip-map.tsv']'] [{2}:'offset', 'line'] [{2}:'offset', 'line']] Each('log data pipe')[regexparser[decl:'logip', 'status'][args:1]] Each('ip data pipe')[regexparser[decl:'ip', 'country'][args:1]] [{2}:'logip', 'status'] [{2}:'ip', 'country'] CoGroup('log data pipe*ip data pipe')[by:ip data pipe:['ip']log data pipe:['logip']] [{4}:'logip', 'status', 'ip', 'country'] TempHfs['SequenceFile[['logip', 'status', 'ip', 'country']]'][log_data_pipe_ip_data_pip/20824/] [{4}:'logip', 'status', 'ip', 'country'] GroupBy('log data pipe*ip data pipe')[by:['country', 'status']] [{4}:'logip', 'status', 'ip', 'country'] File read or write Every('log data pipe*ip data pipe')[count[decl:'count']] [{3}:'country', 'status', 'count'] Map operation Each('log data pipe*ip data pipe')[filternullvalues[decl:'country', 'status', 'count']] [{3}:'country', 'status', 'count'] Reduce operation Hfs['TextLine[['offset', 'line']->[all]]']['build/test/']']
Things I Like Stream Shaping - easy to add, drop fields Field consistency checking in DAG Building blocks - Operations, SubAssemblies Flexible planner - MR, local, Tez
Time for a Change I ve used Cascading for 100s of projects And made a lot of money consulting on ETL But it was getting kind of boring
Flink
Elevator Pitch High throughput/low latency stream processing Also supports batch (bounded streams) Runs locally, stand-alone, or in YARN Super-awesome team
Versus Spark? Sigh OK Very similar in many ways Natively streaming, vs. natively batch Not as mature, smaller community/ecosystem
Similar to Cascading You define a DAG with Java (or Scala) code You have data sources and sinks Data flows through streams to operators The planner turns this into a bunch of tasks
It s Faster, But I don t want to rewrite my code I use lots of custom Cascading schemes I don t really know Scala And POJOs ad nauseam are no fun Same for Tuple21<Integer, String, String, >
Scala is the New APL val input = env.readfilestream(filename,100).flatmap { _.tolowercase.split("\\w+") filter { _.nonempty } }.timewindowall(time.of(60, TimeUnit.SECONDS)).trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(5))).fold(Set[String]()){(r,i) => { r + i}}.map{x => (new Timestamp(System.currentTimeMillis()), x.size)}
Tuple21 Hell public void reduce(iterable<tuple2<tuple3<string, String, Integer>, Tuple2<String, Integer>>> tuplegroup, Collector<Tuple3<String, String, Double>> out) { for (Tuple2<Tuple3<String, String, Integer>, Tuple2<String, Integer>> tuple : tuplegroup) { Tuple3<String, String, Integer> idwc = tuple.f0; Tuple2<String, Integer> idtw = tuple.f1; out.collect(new Tuple3<String, String, Double>(idWC.f0, idwc.f1, (double)idwc.f2 / idtw.f1)); } }
Cascading-Flink
Cascading 3 Planner Converts the Cascading DAG into a Flink DAG Around 5K lines of code DAG it plans looks like Cascading local mode
Boundaries for Data Sets Hfs : 4A668 ['SequenceFile[['fnWikiTermDatum_articleName', 'fnwikitermdatum_term', 'fnwikitermdatum_articleref', 'fnwikitermdatum_termdistance' String, String, String, int]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/terms'] Hfs : CFCE4 ['SequenceFile[['fnWikiCategoryDatum_articleName', 'fnwikicategorydatum_category' String, String]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/categories'] Each : 1FAE7 ('term TF*IDF pipe') [Identity[decl:[{3}:'fnWikiTermDatum_term', 'fnwikitermdatum_articleref', 'fnwikitermdatum_termdistance' String, String, int]]] Each : 867C6 ('term TF*IDF pipe') [ExpressionFunction[decl:[{1}:'term_count']] [args:1]] Each : 8512C Speed == no spill to disk ('term TF*IDF pipe') [ExpressionFilter[decl:[{3}:'fnWikiTermDatum_term', 'fnwikitermdatum_articleref', 'term_count']] [args:1]] Each : 90EF7 ('term TF*IDF pipe') [Identity[decl:[{2}:'term', 'doc']]] Each : 4B9CE Each : A1BC8 ('term count per doc pipe') ('doc count per term pipe') [CompositeFunction[decl:[{3}:'doc', 'term', 'term_count_per_doc']]] [Identity[decl:[{2}:'doc', 'term']]] Task CPU is the same GroupBy : C106E Each : 59A0E ('term count per doc pipe') ('doc count per term pipe') [by:['doc', 'term']] [FilterPartialDuplicates[decl:[{2}:'doc', 'term']]] Every : 7FA5A GroupBy : 8D4A8 Each : 99C62 ('term count per doc pipe') ('doc count per term pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [Sum[decl:[{1}:'term_count_per_doc' [by:['doc', 'term']] Every : BFB12 [NoOp[decl:[{?}:NONE]]] Each : F36A9 Other than serde time ('doc count per term pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [FirstNBuffer[decl:[{2}:'doc', 'term']]] [ExpressionFunction[decl:[{1}:'tf-idf']] Each : 2929A Each : EAD02 Each : 51672 Each : CEC08 ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('term frequency pipe*doc count per term pipe*total doc count pipe') [CompositeFunction[decl:[{2}:'doc', 'total_term_count_per_doc']]] [CompositeFunction[decl:[{2}:'term', 'doc_count_per_term']]] [FilterPartialDuplicates[decl:[{2}:'doc', 'term']]] [ExpressionFilter[decl:[{4}:'doc', 'term', 'term_count_per_doc', 'tf-idf']] GroupBy : 80ECC GroupBy : 3D592 GroupBy : C5FCF GroupBy : 6497F ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('top terms') [by:['doc']] [by:['term']] [by:['doc']] [by:['term']] Every : B50FC Every : 2EB3B Every : D5647 Every : 09850 CoGroup : AFCDA ('total term count per doc pipe') ('doc count per term pipe') ('total doc count pipe') ('top terms') ('term to category') [Sum[decl:[{1}:'total_term_count_per_doc' [Sum[decl:[{1}:'doc_count_per_term' [FirstNBuffer[decl:[{2}:'doc', 'term']]] [First[decl:[{4}:'doc', 'term', 'term_count_per_doc', 'tf-idf']]] [by:categories:['fnwikicategorydatum_articlename' Integer]] long]] GroupBy : 5456C Each : 7A027 Each : 6A732 Each : FC012 ('total doc count pipe') ('top terms') ('term to category') ('total term count per doc pipe') [by:[none]] [Identity[decl:[{4}:'term', 'doc', 'tf-idf', 'term_count_per_doc']]] [Identity[decl:[{3}:'term', 'fnwikicategorydatum_category', 'tf-idf']]] [Identity[decl:[{1}:'temp_doc']]] CoGroup : AE3CF ('term frequency pipe') [by:term count per doc pipe:['doc']total term count per doc pipe:['temp_doc']] Every : C7967 ('total doc count pipe') [Count[decl:[{1}:'total_doc_count']]] Each : EAA7E ('term to category') [CompositeFunction[decl:[{3}:'term', 'fnwikicategorydatum_category', 'category_score']]] Each : 426CB CoGroup : B1688 Hfs : F12B3 GroupBy : FD7CC ('term frequency pipe') ('doc count per term pipe*total doc count pipe') ['TextLine[['offset', 'line' ('term to category') [NoOp[decl:[{?}:NONE]]] [by:doc count per term pipe:[none]total doc count pipe:[none]] long, String]->[ALL]]'] [by:['term', 'fnwikicategorydatum_category']] ['/Users/kenkrugler/Downloads/wikiwords-working/term_scores'] Each : B1AF7 Each : 9F422 Every : 19D98 ('term frequency pipe') ('doc count per term pipe*total doc count pipe') ('term to category') [ExpressionFunction[decl:[{1}:'tf']] [ExpressionFunction[decl:[{1}:'idf']] [Sum[decl:[{1}:'category_score' Each : C9E3A Each : E0099 ('term frequency pipe') ('doc count per term pipe*total doc count pipe') [NoOp[decl:[{?}:NONE]]] [Identity[decl:[{1}:'temp_term']]] Each : A1F82 ('term frequency pipe') Hfs : A8BFD [ExpressionFunction[decl:[{1}:'tf']] ['TextLine[['offset', 'line' long, String]->[ALL]]'] ['/Users/kenkrugler/Downloads/wikiwords-working/term_categories'] CoGroup : A4414 ('term frequency pipe*doc count per term pipe*total doc count pipe') [by:term frequency pipe:['term']doc count per term pipe*total doc count pipe:['temp_term']]
How Painful? Use the FlinkFlowConnector Flink Flow planner to convert DAG to job Uber jar vs. classic Hadoop jar Grungy details of submitting jobs to EMR cluster
Wikiwords Workflow Find association between terms and categories For every page in Wikipedia, for every term Find distance from term to intra-wiki links Then calc statistics to find strong association Prob unusually high that term is close to link
Timing Test Details EMR cluster with 5 i2.xlarge slaves ~1 billion input records (term, article ref, distance) Hadoop MapReduce took 148 minutes Flink took 98 minutes So 1.5x faster - nice but not great Mostly due to spillage in many boundaries
Summary
If You re a Java ELT Dev And you have to deal with batch big data Then the Cascading API is a good fit And using Flink typically gives better performance While still using a standard Hadoop/YARN cluster
Status of Cascading-Flink Still young, but surprisingly robust Doesn t support Full or RightOuter HashJoins Pending optimizations Tuple serialization Flink improvements
Better Planning Data Source 1 Data Source 2 Operation 1 Group Defer late-stage join Operation 2 Avoid premature resource usage Join
More questions? Feel free to contact me http://www.scaleunlimited.com/contact/ ken@scaleunlimited.com Check out Cascading, Flink, and Cascading-Flink http://www.cascading.org http://flink.apache.org http://github.com/dataartisans/cascading-flink