Data Processing with Apache Beam (incubating) and Google Cloud Dataflow

Size: px

Start display at page:

Download "Data Processing with Apache Beam (incubating) and Google Cloud Dataflow"

Junior Spencer
6 years ago
Views:

1 Data Processing with Apache Beam (incubating) and Google Cloud Dataflow Jelena Pjesivac-Grbovic Staff software engineer Cloud Big Data In collaboration with Frances Perry, Tayler Akidau, and Dataflow team XLDB 16 - May 2016

2 Agenda 1 Infinite, Out-of-Order Data Sets 2 What, Where, When, How 3 Apache Beam (incubating) 4 Google Cloud Dataflow

3 1 Infinite, Out-of-Order Data Sets

4 Data...

5 ...can be big...

6 ...really, really big... Thursday Wednesday Tuesday

7 maybe infinitely big... 8:00 9:00 10:00 11:00 12:00 13:00 14:00

8 with unknown delays. 8:00 8:00 8:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00

9 Element-wise transformations 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Processing Time

10 Aggregating via Processing-Time Windows 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Processing Time

11 Aggregating via Event-Time Windows Input Processing Time 10:00 11:00 12:00 13:00 14:00 15:00 10:00 11:00 12:00 13:00 14:00 15:00 Output Event Time

12 Processing Time Formalizing Event-Time Skew Skew Reality Ideal Event Time

13 Processing Time Formalizing Event-Time Skew Skew Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" ~Watermark Ideal Often heuristic-based. Event Time Too Slow? Results are delayed. Too Fast? Some data is late.

14 2 What, Where, When, How

15 What are you computing? Where in event time? When in processing time? How do refinements relate?

16 What are you computing? Element-Wise Aggregating Composite What Where When How

17 What: Computing Integer Sums // Collection of raw log lines PCollection<String> raw = IO.read(...); // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(pardo.of(new ParseFn()); // Composite transformation containing an aggregation PCollection<KV<String, Integer>> scores = input.apply(sum.integersperkey()); What Where When How

18 What: Computing Integer Sums What Where When How

19 What: Computing Integer Sums What Where When How

20 Where in event time? Windowing divides data into event-time-based finite chunks. Fixed Sliding Sessions Key 1 Key 2 Key Time Often required when doing aggregations over unbounded data. What Where When How

21 Where: Fixed 2-minute Windows PCollection<KV<String, Integer>> scores = input.apply(window.into(fixedwindows.of(minutes(2))).apply(sum.integersperkey()); What Where When How

22 Where: Fixed 2-minute Windows What Where When How

23 When in processing time? Processing Time Triggers control Skew ~Watermark Ideal when results are emitted. Triggers are often relative to the watermark. Event Time What Where When How

24 When: Triggering at the Watermark PCollection<KV<String, Integer>> scores = input.apply(window.into(fixedwindows.of(minutes(2)).triggering(atwatermark())).apply(sum.integersperkey()); What Where When How

25 When: Triggering at the Watermark What Where When How

26 When: Early and Late Firings PCollection<KV<String, Integer>> scores = input.apply(window.into(fixedwindows.of(minutes(2)).triggering(atwatermark().withearlyfirings(atperiod(minutes(1))).withlatefirings(atcount(1)))).apply(sum.integersperkey()); What Where When How

27 When: Early and Late Firings What Where When How

28 How do refinements relate? How should multiple outputs per window accumulate? Appropriate choice depends on consumer. Firing Elements Discarding Accumulating Acc. & Retracting Speculative [3] Watermark [5, 1] 6 9 9, -3 Late [2] , -9 Last Observed Total Observed (Accumulating & Retracting not yet implemented.) What Where When How

29 How: Add Newest, Remove Previous What Where When How

30 What / Where / When / How 1.Classic Batch 2. Batch with Fixed Windows 3. Streaming 4. Streaming with Speculative + Late Data 5. Streaming With Retractions 6. Sessions What Where When How

31 3 Apache Beam (incubating)

32 The Evolution of Beam Colossus BigTable PubSub Dremel Google Cloud Dataflow Spanner MapReduce Megastore Millwheel Flume Apache Beam

33 What is Part of Apache Beam? 1. The Beam Model: What / Where / When / How 2. SDKs for writing Beam pipelines -- starting with Java 3. Runners for Existing Distributed Processing Backends Apache Flink (thanks to data Artisans) Apache Spark (thanks to Cloudera) Google Cloud Dataflow (fully managed service) Local (in-process) runner for testing

34 Apache Beam Technical Vision End users: who want to write pipelines or transform libraries in a language that s familiar. SDK writers: who want to make Beam concepts available in new languages. Runner writers: who have a distributed processing environment and want to support Beam pipelines Other Languages Beam Java Beam Python Beam Model: Pipeline Construction Runner A Cloud Dataflow Runner B Beam Model: Fn Runners Execution Execution Execution

35 Growing the Beam Community Collaborate - Beam is becoming a communitydriven effort with participation from many organizations and contributors Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem

36 4 Google Cloud Dataflow

37 Google Cloud Dataflow Fully managed service for running Beam pipelines Dynamically provisioned, on-demand resources VMs, temporary storage No tuning required Autoscaling + Dynamic Work Rebalancing Built from the experience with Google internal products

Without DWR Workers Workers Time With DWR Time For more info

38 No Tuning Required: Dynamic Work Rebalancing Advanced straggler mitigation technique Ensures all tasks finish at the same time Without DWR Workers Workers Time With DWR Time For more info google: No shard left behind: dynamic work rebalancing in Google Cloud Dataflow

39 No Tuning Required: Autoscaling Dynamically adjust to the number of workers to match the load Both for streaming and batch Workers Time Time For more info google: Comparing Cloud Dataflow autoscaling to Spark and Hadoop

Kafka, HDFS, many in flight Part of Google Cloud Platform

40 Integrations Apache Beam connectors Google Cloud Storage, BigQuery, BigTable, Datastore, Pub/Sub, External / Custom IO Kafka, HDFS, many in flight Part of Google Cloud Platform Monitoring UI Cloud Logging Cloud Debugger and Profiler Stackdriver integration

41 Big Data in Google Cloud Platform BigQuery A fast, economical, and fully managed data warehouse solution Dataflow Dataproc Interactive large scale data analysis, exploration and visualization Pub/Sub Fast, easy to use managed Spark and Hadoop service Datalab(beta) Fully managed, real-time, data processing service for batch and streaming Reliable, many-to-many, asynchronous messaging service Genomics Empowers scientists to organize world s genomics information

based on our powerful Vision APIs Speech API Speech to text conversion powered by

42 Machine Learning in Google Cloud Platform Machine Learning Platform(alpha) Fast, large scale, easy to use Machine Learning service Vision API Enables insights based on our powerful Vision APIs Speech API Speech to text conversion powered by Machine Learning Translate API Enables multilingual apps and programmatic translation

43 Learn More! Apache Beam (incubating) Google Cloud Dataflow Google Cloud Platform

44 Thank you!

An Introduction to The Beam Model

An Introduction to The Beam Model Apache Beam (incubating) Slides by Tyler Akidau & Frances Perry, April 2016 Agenda 1 Infinite, Out-of-order Data Sets 2 The Evolution of the Beam Model 3 What, Where,