TOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM

Size: px

Start display at page:

Download "TOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM"

Ginger Gallagher
5 years ago
Views:

TOWARDS PORTABILITY AND BEYOND Maximilian Michels mxm@apache.

1 TOWARDS PORTABILITY AND BEYOND Maximilian Michels DATA PROCESSING WITH APACHE maximilianmichels.com

2 !2 BEAM VISION Write Pipeline Execute SDKs Runners Backends

3 !3 THE BEAM MODEL Unified batch and stream programming model Stream: Batch is just a bounded stream Massively parallelizable Transformations 2015 VLDB: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing Event Time as an explicit concept

4 BEAM API

5 !5 THE BEAM API Pipeline: An (acyclic) graph of PCollections [Output PCollection] = [Input PCollection].apply([Transform]) Pipeline p = Pipeline.create(options) PCollection pcollection = p.apply( ).apply( ). p.run()

6 !6 INPUT Pipeline p = Pipeline.create(options); PCollection<String> input1 = p.apply( ReadMyFile", TextIO.read().from( protocol://path/to/some/inputdata.txt")); PCollection<String> input2 = p.apply( Create.of( DataEngConf, is, awesome )); PCollection<KV<Long, String>> input3 = p.apply( KafkaIO.<Long, String>read().withBootstrapServers("broker_1:9092,broker_2:9092").withTopic("my_topic").withKeyDeserializer(LongDeserializer.class).withValueDeserializer(StringDeserializer.class))

7 !7 CORE PRIMITIVES TRANSFORMS ParDo GroupByKey input -> output to -> KV< to, 1> be -> KV< be, 1> or -> KV< or, 1> not -> KV< not,1> to -> KV< to, 1> be -> KV< be, 1> KV<k,v> -> KV<k, [v ]> KV< to, [1,1]> KV< be, [1,1]> KV< or, [1 ]> KV< not,[1 ]> Map/Reduce Phase Shuffle Phase

8 !8 CORE PRIMITIVES TRANSFORMS ParDo PCollection<String> words =...; PCollection<KV<String, Integer>> wordcounts = words.apply( AssignWordCounts", ParDo.of(new DoFn<String, KV<String, Integer>>() public void processelement(@element String word, OutputReceiver<Integer> out) { out.output(kv.of(word, 1)); } } )); GroupByKey PCollection<KV<String, Iterable<Integer>>> groupbywordcounts = wordcounts.apply( GroupByWords, GroupByKey.create());

9 !9 COMPOSITE TRANSFORMS Combine PCollection<KV<String, Integer> wordcounts = PCollection<KV<String, Integer> combinedwordcounts = wordcounts.apply( Combine.perKey(new SerializableFunction<Iterable<Integer>, Integer> public Integer apply(iterable<integer> input) { int sum = 0; for (int count : input) { sum += count; } return sum; For more sophisticated combing, you can define a CombineFn. } }); (Map/) Shuffle/Reduce Phase

10 !10 MORE TRANSFORMS CoGroupByKey (Join) Flatten Partition Define your own! Also: Side Inputs / Multiple Outputs / State / Timers

11 PROCESSING UNBOUNDED DATA

12 !12 PROCESSING (UN)BOUNDED DATA Stream are unbounded by nature Windows group data according to a windowing function Triggers decide when to kick off execution of windows Event Time is the predominant domain for windowing Every element has an event timestamp Watermark indicates the current event time

13 !13 EVENT VS PROCESSING TIME 60 Input Ideal 50 Event Time Early Late Processing Time Visualization by Frances Perry and Tyler Akidau

14 !14 EVENT VS PROCESSING TIME: STAR WARS Year Episode IV V VI I II III VII VIII IX These episodes appeared out-of-order Year Episode I II III IV V VI VII VIII IX That s much better! PROCESSING TIME EVENT TIME

15 !15 THE BEAM MODEL: A SIMPLE PIPELINE What, Where, When, How PCollection<KV<String, Integer>> scores = input.apply(sum.integersperkey());

into(fixedwindows.of(duration.standardminutes(2))).apply(sum.

16 !16 A SIMPLE PIPELINE: WINDOWING What, Where, When, How PCollection<KV<String, Integer>> scores = input.apply(window.into(fixedwindows.of(duration.standardminutes(2))).apply(sum.integersperkey()); Window Types Global Window Fixed Time Windows Sliding Time Windows Per-Session Windows

into(fixedwindows.of(duration.standardminutes(2)).

17 !17 A SIMPLE PIPELINE: TRIGGERING What, Where, When, How PCollection<KV<String, Integer>> scores = input.apply(window.into(fixedwindows.of(duration.standardminutes(2)).triggering(atwatermark()).apply(sum.integersperkey()); Triggers Event time Processing Data-driven Composite triggers

18 !18 A SIMPLE PIPELINE: TRIGGERING What, Where, When, How PCollection<KV<String, Integer>> scores = input.apply(window.into(fixedwindows.of(duration.standardminutes(2)).triggering(atwatermark().withearlyfirings(atperiod(duration.standardminutes(1))).withlatefirings(atcount(1))).apply(sum.integersperkey()); Triggers Event time Processing Data-driven Composite triggers

19 !19 A SIMPLE PIPELINE: REFINEMENT What, Where, When, How PCollection<KV<String, Integer>> scores = input.apply(window.into(fixedwindows.of(duration.standardminutes(2)).triggering(atwatermark().withearlyfirings(atperiod(duration.standardminutes(1))).withlatefirings(atcount(1)).accumulatingfiredpanes()).apply(sum.integersperkey());

20 !20 JAVA VS PYTHON PCollection<KV<String, Integer>> scores = input.apply(window.into(fixedwindows.of(duration.standardminutes(2)).triggering(atwatermark().withearlyfirings(atperiod(duration.standardminutes(1))).withlatefirings(atcount(1)).accumulatingfiredpanes()).apply(sum.integersperkey()); scores = input WindowInto(FixedWindows(120) trigger=afterwatermark( early=afterprocessingtime(60), late=aftercount(1)) accumulation_mode=accumulating) CombinePerKey(sum)

21 !21 EXECUTE WITH CHOICE OF RUNNER input = pipeline ReadFromText("/path/to/text*") Map(lambda line:...) scores = input WindowInto(FixedWindows(120) trigger=afterwatermark( early=afterprocessingtime(60), late=aftercount(1)) accumulation_mode=accumulating) CombinePerKey(sum)) scor WriteToText("/path/to/outputs") MyRunner().run(pipeline)

22 RUNNERS

23 !23 RUNNERS WIP Direct Apache Flink Apache Spark Apache Apex Ali Baba JStorm Apache Storm Apache Samza Google Cloud Dataflow Apache Gearpump IBM Streams Hadoop MapReduce

24 !24 PROBLEMS WITH THIS APPROACH All execution backends written in Java Can Python run on top of Java? N Runners, M languages => N*M translation paths? Submission Flow

25 PORTABILITY

26 !26 THE BEAM VISION - Portability in Beam means Pipelines Beam Beam Beam Java Go Python can be written and executed in any supported SDK (Java/Python/Go) Pipeline (Runner API) - Pipelines also contain languagespecific code (e.g. map/reduce functions) Apache Flink Cloud Dataflow Apache Spark - Libraries of the language can be used Execution (Fn API) (!) Execution Execution Execution

27 !27 WITHOUT PORTABILITY language-specific SDK RUNNER Backend (e.g. Flink) TASK 1 TASK 2 TASK 3 TASK N All components are tight to a single language

28 !28 WITH PORTABILITY language-specific language-agnostic SDK Job API JOB SERVER Runner API RUNNER Portable Job Backend (e.g. Flink) TASK 1 TASK 2 TASK 3 TASK N SDK Fn API HARNESS SDK Fn API HARNESS

29 !29 WHAT IS THE STATE OF THE BEAM VISION? Runners * SDKs is not feasible Instead each SDK only implements a Portable Runner Flink Runner is the first OSS Runner to work with the Portable Runner Cross-language pipelines in the future Overhead has been measured to be 5-10%, could be less in real-world scenarios

30 DEMO TIME

31 !31 HOW TO GET INVOLVED Visit beam.apache.org Documentation Examples Subscribe to the mailing lists: Join the ASF Slack channel #beam Maximilian maximilianmichels.com

FROM ZERO TO PORTABILITY

FROM ZERO TO PORTABILITY? Maximilian Michels mxm@apache.org APACHE BEAM S JOURNEY TO CROSS-LANGUAGE DATA PROCESSING @stadtlegende maximilianmichels.com FOSDEM 2019 What is Beam? What does portability mean?