Processing Data Like Google Using the Dataflow/Beam Model

Size: px

Start display at page:

Download "Processing Data Like Google Using the Dataflow/Beam Model"

Silas Griffith
5 years ago
Views:

1 Todd Reedy Google for Work Sales Engineer Google Processing Data Like Google Using the Dataflow/Beam Model

2 Goals: Write interesting computations Run in both batch & streaming Use custom timestamps Handle late data

3 Weave provides the fastest way to connect smart devices to the network, the cloud, and an ecosystem of (business) apps & smart devices. Weave setup works on Android, ios, and desktop. CRM Want more details? Visit:

code. Brillo comes with services to help understand how customers interact with it, and keep it updated over time.

4 Brillo is a free OS, tightly integrated with Android, tailored for IoT. Brillo makes it cost-effective to build a secure connected device, with verified boot to ensure it runs only trusted code. Brillo comes with services to help understand how customers interact with it, and keep it updated over time. Have Brillo running on your device itself or use it as a hub to aggregate data from nearby (micro) processors. OTA Updates Weave Metrics Android HAL Kernel Hardware Telemetry

5 Agenda 1 Data Shapes 2 Google s Data Processing Story 3 The Beam Model 4 Demo: Google Cloud Dataflow

6 1 Data Shapes

7 Data...

8 ...can be big...

9 ...really, really big... Thursday Wednesday Tuesday

10 ...maybe even infinitely big... 8:00 1:00 9:00 2:00 10:00 3:00 11:00 4:00 12:00 5:00 13:00 6:00 14:00 7:00

11 with unknown delays. 8:00 8:00 8:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00

12 Data Processing Tradeoffs = 2 $$$ Completeness Latency Cost

13 Requirements: Billing Pipeline Important Not Important Completeness Latency Cost

14 Requirements: Live Cost Estimate Pipeline Important Not Important Completeness Latency Cost

15 Requirements: Abuse Detection Pipeline Important Not Important Completeness Latency Cost

16 Requirements: Abuse Detection Backfill Pipeline Important Not Important Completeness Latency Cost

17 2 Google s Data Processing Story

18 Google s Data-Related Papers Dataflow MapReduce FlumeJava Dremel GFS Big Table Pregel Spanner Colossus MillWheel

19 MapReduce: Batch Processing M M M R R

20 FlumeJava: Easy and Efficient MapReduce Pipelines Higher-level API with simple data processing abstractions. Focus on what you want to do to your data, not what the underlying system supports. A graph of transformations is automatically transformed into an optimized series of MapReduces.

21 Batch Patterns: Creating Structured Data MapReduce

22 Batch Patterns: Repetitive Runs Thursday Wednesday Tuesday MapReduce

23 Batch Patterns: Time Based Windows Tuesday [11:00-12:00) [12:00-13:00) [13:00-14:00) [14:00-15:00) [15:00-16:00) [16:00-17:00) [18:00-19:00) [19:00-20:00) MapReduce [21:00-22:00) [22:00-23:00) [23:00-0:00)

24 Batch Patterns: Sessions Tuesday Wednesday Tuesday Wednesday Jose Lisa MapReduce Ingo Asha Cheryl Ari

25 MillWheel: Streaming Computations Framework for building low-latency data-processing applications User provides a DAG of computations to be performed System manages state and persistent flow of elements

26 Streaming Patterns: Element-wise transformations 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Processing Time

27 Streaming Patterns: Aggregating Time Based Windows 8:00 9:00 10:00 11:00 12:00 13:00 14:00 Processing Time

28 Streaming Patterns: Event Time Based Windows Input Processing Time 10:00 11:00 12:00 13:00 14:00 15:00 Output Event Time 10:00 11:00 12:00 13:00 14:00 15:00

29 Streaming Patterns: Session Windows Input Processing Time 10:00 11:00 12:00 13:00 14:00 15:00 Output Event Time 10:00 11:00 12:00 13:00 14:00 15:00

30 Processing Time Event-Time Skew Skew Event Time

31 Processing Time Event-Time Skew Watermark Skew Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" Often heuristic-based. Event Time Too Slow? Results are delayed. Too Fast? Some data is late.

32 Streaming or Batch? = 2 $$$ Completeness Latency Cost Why not both?

33 3 The Dataflow Model

34 3 The Apache Beam Model

35 What are you computing? Where in event time? When in processing time? How do refinements relate?

36 What are you computing? A Pipeline represents a graph of data processing transformations PCollections flow through the pipeline Optimized and executed as a unit for efficiency

37 What are you computing? A PCollection<T> is a collection of data of type T Maybe be bounded or unbounded in size Each element has an implicit timestamp Initially created from backing data stores

38 What are you computing? PTransforms transform PCollections into other PCollections. Element-Wise Aggregating Composite What Where When How

39 Example: Computing Integer Sums // Collection of raw log lines PCollection<String> raw =...; // Element-wise transformation into team/score pairs PCollection<KV<String, Integer>> input = raw.apply(pardo.of(new ParseFn())) // Composite transformation containing an aggregation PCollection<KV<String, Integer>> output = input.apply(sum.integersperkey()); What Where When How

40 Example: Computing Integer Sums What Where When How

41 Example: Computing Integer Sums What Where When How

42 Example: Computing Integer Sums What Where When How

43 Where in Event Time? Windowing divides data into eventtime-based finite chunks Key 1 Key 2 Key Key 1 Key 2 Key Key 1 Key 2 Key Required when doing aggregations over unbounded data. Fixed Sliding Sessions What Where When How

44 Example: Fixed 2-minute Windows PCollection<KV<String, Integer>> output = input.apply(window.into(fixedwindows.of(minutes(2)))).apply(sum.integersperkey()); What Where When How

45 Example: Fixed 2-minute Windows What Where When How

46 Processing Time When in Processing Time? Watermark Triggers control when results are emitted. Event Time Triggers are often relative to the watermark. What Where When How

47 Example: Triggering at the Watermark PCollection<KV<String, Integer>> output = input.apply(window.into(fixedwindows.of(minutes(2))).trigger(atwatermark())).apply(sum.integersperkey()); What Where When How

48 Example: Triggering at the Watermark What Where When How

49 Example: Triggering for Speculative & Late Data PCollection<KV<String, Integer>> output = input.apply(window.into(fixedwindows.of(minutes(2))).trigger(atwatermark().withearlyfirings(atperiod(minutes(1))).withlatefirings(atcount(1)))).apply(sum.integersperkey()); What Where When How

50 Example: Triggering for Speculative & Late Data What Where When How

51 How do Refinements Relate? How should multiple outputs per window accumulate? Appropriate choice depends on consumer. Firing Elements Discarding Speculative 3 Watermark 5, 1 Late 2 Total Observ What Where When How

52 How do Refinements Relate? How should multiple outputs per window accumulate? Appropriate choice depends on consumer. Firing Elements Discarding Accumulating Acc. & Retracting Speculative Watermark 5, , -3 Late , -9 Total Observ What Where When How

53 Example: Add Newest, Remove Previous PCollection<KV<String, Integer>> output = input.apply(window.into(sessions.withgapduration(minutes(1))).trigger(atwatermark().withearlyfirings(atperiod(minutes(1))).withlatefirings(atcount(1))).accumulatingandretracting()).apply(new Sum()); What Where When How

54 Example: Add Newest, Remove Previous What Where When How

55 Customizing What Where When How 1. Classic Batch 2. Batch with Fixed Windows 3. Streaming 4. Streaming with 5. Streaming with Speculative + Late Data Retractions What Where When How

56 4 Demo: Google Cloud Dataflow

57 Google Cloud Dataflow A fully-managed cloud service and programming model for batch and streaming big data processing. Cloud Dataflow

58 Open Source SDKs Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run On your development machine On the Dataflow Service on Google Cloud Platform On third party environments like Spark or Flink.

59 Fully Managed Dataflow Service Runs the pipeline on Google Cloud Platform. Includes: Graph optimization: Modular code, efficient execution Smart Workers: Lifecycle management, Autoscaling, and Smart task rebalancing Easy Monitoring: Dataflow UI, Restful API and CLI, Integration with Cloud Logging, etc.

60 Demo Time!

61 Wrote Reusable code (PTransforms) Built upon the same code (PTransform) in batch and streaming Used event time, not processing time Adjusted windowing and late data settings

62 Learn More! tec2016.toddreedy.com

Google Cloud Dataflow

Google Cloud Dataflow A Unified Model for Batch and Streaming Data Processing Jelena Pjesivac-Grbovic STREAM 2015 Agenda 1 Data Shapes 2 Data Processing Tradeoffs 3 Google s Data Processing Story 4 Google