Portable stateful big data processing in Apache Beam

Size: px

Start display at page:

Download "Portable stateful big data processing in Apache Beam"

Marcia Peters
5 years ago
Views:

1 Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Google klk@google.com Flink Forward San Francisco 2017

2 Agenda 1. What is Apache Beam? 2. State 3. Timers 4. Example & Little Demo

3 What is Apache Beam?

4 TL;DR (Flink draws it more like this) 4

Cloud Dataflow Apache Apex Heron Dataflow Model (paper) Apache Gearpump

5 DAGs, DAGs, DAGs Apache Beam Apache Hadoop MapReduce (paper) Apache Spark FlumeJava (paper) Apache Flink Apache Storm MillWheel (paper) Apache Samza Cloud Dataflow Apache Apex Heron Dataflow Model (paper) Apache Gearpump (incubating)

6 The Beam Vision Java input.apply( Sum.integersPerKey()) Python input Sum.PerKey() Sum Per Key Apache Flink local, on-prem, cloud Cloud Dataflow: fully managed Apache Spark local, on-prem, cloud Apache Apex local, on-prem, cloud Apache Gearpump (incubating) 6

7 The Beam Vision Python input KakaIO.read() KafkaIO Apache Flink local, on-prem, cloud Cloud Dataflow: fully managed Apache Spark local, on-prem, cloud Apache Apex local, on-prem, cloud Java class KafkaIO extends UnboundedSource { } Apache Gearpump (incubating) 7

8 The Beam Model PTransform Pipeline PCollection (bounded or unbounded) 8

9 The Beam Model What are you computing? (read, map, reduce) Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) 9

10 What are you computing? Read Parallel connectors to external systems ParDo Per element "Map" Grouping Group by key, Combine per key, "Reduce" Composite Encapsulated subgraph 10

11 State

12 What are you computing? Read Parallel connectors to external systems ParDo Per element "Map" Grouping Group by key, Combine per key, "Reduce" Composite Encapsulated subgraph 12

13 State for ParDo ParDo(DoFn) 13

14 "Example" ParDo.of(new DoFn<...>() { // declare some public void process(...) { // update quantiles // output if needed } }) Per Key 14

15 Partitioned 15

16 Parallelism! new DoFn<Foo, <Foo>>() private final StateSpec<...> quantilesstate = StateSpecs.combining(); } DoFn DoFn DoFn 16

17 Windowed Window into Fixed windows of one hour DoFn DoFn DoFn Expected result: for each hour 17

18 Windowed Window into windows of 30 min sliding by 10 min DoFn DoFn DoFn Expected result: for 30 minutes sliding by 10 min 18

19 State is per key and window <k, w> 1 <k, w> 2 <k, w> 3... "x" "y" "fizz" "7" "fizzbuzz"... Bonus: automatically garbage collected when a window expires (vs manual clearing of keyed state)

20 Windowed Window into Fixed windows of one hour DoFn DoFn DoFn Expected result: for each hour 20

21 What about Combine? Window into Fixed windows of one hour Combiner Combiner Combiner Expected result: for each hour 21

22 Combine vs State (naive conceptualization) Window hourly Window hourly Combiner DoFn Expected result: for each hour Expected result: for each hour 22

23 Combine vs State (likely execution plan) Combiner Shuffle Accumulators Combiner Shuffle Elements non-associative* non-commutative* multi-output side outputs (user is in control) associative Combiner commutative single-output enables optimizations (engine is in control) DoFn

24 Combine vs State Combiner DoFn output governed by trigger (data/computation unaware) "output only when there's an interesting change" (data/computation aware)

25 Example: Arbitrary indices E B A C D Assign indices non-associative non-commutative E,4 B,3 non-deterministic and totally fine! A,2 C,1 D,0

26 Kinds of state Value - just a mutable cell for a value Bag - supports "blind writes" Combining - has a CombineFn built in; can support blind writes and lazy accumulation Set - membership checking Map - lookups and partial writes

27 Timers

28 Timers for ParDo new DoFn<...>() private final TimerSpec timeouttimer = TimerSpecs.timer(TimeDomain.PROCESSING_TIME); public void timeout(...) { // access state, set new timers, output } Stateful ParDo 28

29 Timers in processing time buffer request "call me back in 5 seconds" On Element On Timer batched RPC output based only on incoming element output return value of batched RPC

30 Timers in event time "call me back when the watermark hits the end of the window" On Element store speculative result On Timer output speculative result immediately output replacement value only if changed

31 More example uses for state & timers Per-key arbitrary numbering Output only when result changes Tighter "side input" management for slowly changing dimension Streaming join-matrix / join-biclique Fine-grained combine aggregation and output control Per-key "workflows" like user sign up flow w/ expiration Low-latency deduplication (let the first through, squash the rest)

32 Performance considerations (cross-runner) Shuffle to colocate keys Linear processing of elements for key+window Window merging Storage of state and timers GC of state 32

33 Demo 33

34 Summary State and Timers in Beam... unlock new uses cases they "just work" with event time windowing are portable across runners (implementation ongoing)

35 Thank you for listening! This talk: Me These Slides - Go Deeper Design doc - Blog post - Join the Beam community: User discussions - user@beam.apache.org Development discussions - dev@beam.apache.org on Twitter You can contribute to Beam + Flink New types of state Easy launch of Beam job on Flink-on-YARN Integration tests at scale Fit and finish: polish, polish, polish! and lots more!

TOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM

TOWARDS PORTABILITY AND BEYOND Maximilian Michels mxm@apache.org DATA PROCESSING WITH APACHE BEAM @stadtlegende maximilianmichels.com !2 BEAM VISION Write Pipeline Execute SDKs Runners Backends !3 THE