Using Apache Beam for Batch, Streaming, and Everything in Between. Dan Halperin Apache Beam PMC Senior Software Engineer, Google

Similar documents
Introduction to Apache Beam

Apache Beam. Modèle de programmation unifié pour Big Data

Data Processing with Apache Beam (incubating) and Google Cloud Dataflow

TOWARDS PORTABILITY AND BEYOND. Maximilian maximilianmichels.com DATA PROCESSING WITH APACHE BEAM

Fundamentals of Stream Processing with Apache Beam (incubating)

An Introduction to The Beam Model

Portable stateful big data processing in Apache Beam

FROM ZERO TO PORTABILITY

Google Cloud Dataflow

Apache Beam: portable and evolutive data-intensive applications

Processing Data Like Google Using the Dataflow/Beam Model

Nexmark with Beam. Evaluating Big Data systems with Apache Beam. Etienne Chauchot, Ismaël Mejía. Talend

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa

Lecture 11 Hadoop & Spark

The Stream Processor as a Database. Ufuk

Apache Flink. Alessandro Margara

Experiences with Apache Beam. Dan Debrunner Programming Model Architect IBM Streams STSM, IBM

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Apache Flink- A System for Batch and Realtime Stream Processing

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Unifying Big Data Workloads in Apache Spark

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Streaming Analytics with Apache Flink. Stephan

How Apache Beam Will Change Big Data

Streaming Auto-Scaling in Google Cloud Dataflow

WHITEPAPER. MemSQL Enterprise Feature List

Processing Data of Any Size with Apache Beam

CSE 444: Database Internals. Lecture 23 Spark

Data-Intensive Distributed Computing

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

Practical Big Data Processing An Overview of Apache Flink

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

Apache Flink Big Data Stream Processing

The Evolution of Big Data Platforms and Data Science

Real-time data processing with Apache Flink

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

CompSci 516: Database Systems

Data Analytics with HPC. Data Streaming

Time Series Storage with Apache Kudu (incubating)

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Database Systems CSE 414

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

SnailTrail Generalizing Critical Paths for Online Analysis of Distributed Dataflows

The Power of Snapshots Stateful Stream Processing with Apache Flink

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Expert Lecture plan proposal Hadoop& itsapplication

Advanced Data Processing Techniques for Distributed Applications and Systems

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Flash Storage Complementing a Data Lake for Real-Time Insight

Naiad (Timely Dataflow) & Streaming Systems

Batch Processing Basic architecture

Fast and Easy Stream Processing with Hazelcast Jet. Gokhan Oner Hazelcast

Real-time Streaming Applications on AWS Patterns and Use Cases

Turning Relational Database Tables into Spark Data Sources

Personalizing Netflix with Streaming datasets

Evolution of an Apache Spark Architecture for Processing Game Data

Data Acquisition. The reference Big Data stack

Hadoop, Spark, Flink, and Beam Explained to Oracle DBAs: Why They Should Care

MillWheel:Fault Tolerant Stream Processing at Internet Scale. By FAN Junbo

Performance Measurement of Stream Data Processing in Apache Spark

Spark, Shark and Spark Streaming Introduction

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Distributed systems for stream processing

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

Introduction to Big-Data

Modern Stream Processing with Apache Flink

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

Architecture of Flink's Streaming Runtime. Robert

Distributed Data Analytics Stream Processing

The Future of Real-Time in Spark

Spark Streaming. Guido Salvaneschi

Fluentd + MongoDB + Spark = Awesome Sauce

Databases 2 (VU) ( / )

Cloud Computing CS

Apache Storm. Hortonworks Inc Page 1

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

/ Cloud Computing. Recitation 15 December 6 th 2016

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Data pipelines with PostgreSQL & Kafka

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

MSG: An Overview of a Messaging System for the Grid

Introduction to MapReduce

Building Durable Real-time Data Pipeline

Transcription:

Abstract Apache Beam is a unified programming model capable of expressing a wide variety of both traditional batch and complex streaming use cases. By neatly separating properties of the data from run-time characteristics, Beam enables users to easily tune requirements around completeness and latency and run the same pipeline across multiple runtime environments. In addition, Beam's model enables cutting edge optimizations, like dynamic work rebalancing and autoscaling, giving those runtimes the ability to be highly efficient. This talk will cover the basics of Apache Beam, touch on its evolution, and describe the main concepts in its powerful programming model. We'll include detailed, concrete examples of how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios.

Using Apache Beam for Batch, Streaming, and Everything in Between Dan Halperin (@dhalperi) Apache Beam PMC Senior Software Engineer, Google

Apache Beam: Open Source Data Processing APIs Expresses data-parallel batch and streaming algorithms with one unified API. Cleanly separates data processing logic from runtime requirements. Supports execution on multiple distributed processing runtime environments. Integrates with the larger data processing ecosystem.

Announcing the First Stable Release

Apache Beam at this conference Using Apache Beam for Batch, Streaming, and Everything in Between Dan Halperin @ 10:15 am Apache Beam: Integrating the Big Data Ecosystem Up, Down, and Sideways Davor Bonaci, and Jean-Baptiste Onofré @ 11:15 am Concrete Big Data Use Cases Implemented with Apache Beam Jean-Baptiste Onofré @ 12:15 pm Nexmark, a Unified Framework to Evaluate Big Data Processing Systems Ismaël Mejía, and Etienne Chauchot @ 2:30 pm

Apache Beam at this conference Apache Beam Birds of a Feather Wednesday, 6:30 pm - 7:30 pm Apache Beam Hacking Time Time: all-day Thursday 2nd floor collaboration area (depending on interest)

This talk: Apache Beam introduction and update

This talk: Apache Beam introduction and update Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines

The Beam Model: Asking the Right Questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

1.Classic Batch 2. Batch with Fixed Windows 3. Sessions 4. Streaming 5. Streaming with Speculative + Late Data

What is Apache Beam? The Beam Programming Model What / Where / When / How Beam Java Other Languages Beam Python SDKs for writing Beam pipelines Java, Python d Beam Model: Pipeline Construction Beam Runners for existing distributed Apache Apex Apache Flink Apache Gearpump Apache Spark Cloud Dataflow processing backends Apache Apex Apache Flink Beam Model: Fn Runners Apache Spark Google Cloud Dataflow Execution Execution Execution

Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines

Simple clickstream analysis pipeline Data: JSON-encoded analytics stream from site { user : dhalperi, page : apache.org/feed/7, tstamp : 2016-08-31T15:07Z, } 3:00 3:05 3:10 3:15 3:20 3:25 Event time Desired output: Per-user session length and activity level dhalperi, 33 pageviews, 2016-08-31 15:04-15:25

Simple clickstream analysis pipeline Data: JSON-encoded analytics stream from site { user : dhalperi, page : apache.org/feed/7, tstamp : 2016-08-31T15:07Z, } One session, 3:04-3:25 3:00 3:05 3:10 3:15 3:20 3:25 Event time Desired output: Per-user session length and activity level dhalperi, 33 pageviews, 2016-08-31 15:04-15:25

Two example applications Streaming job consuming Kafka stream Uses 10 workers. Pipeline lag of a few seconds. With a 2 million users over 1 day. Want fresh, correct results at low latency Okay to use more resources

Two example applications Streaming job consuming Kafka stream Uses 10 workers. Pipeline lag of a few seconds. With a 2 million users over 1 day. Batch job consuming HDFS archive Uses 200 workers. Runs for 30 minutes. Same input. Want fresh, correct results at low latency Okay to use more resources Accurate results at job completion Batch efficiency

Two example applications Streaming job consuming Kafka stream Uses 10 workers. Pipeline lag of a few seconds. With a 2 million users over 1 day. Batch job consuming HDFS archive Uses 200 workers. Runs for 30 minutes. Same input. Want fresh, correct results at low latency Okay to use more resources Accurate results at job completion Batch efficiency What does the user have to change to get these results?

Two example applications Streaming job consuming Kafka stream Uses 10 workers. Pipeline lag of a few seconds. With a 2 million users over 1 day. Batch job consuming HDFS archive Uses 200 workers. Runs for 30 minutes. Same input. Want fresh, correct results at low latency Okay to use more resources Accurate results at job completion Batch efficiency What does the user have to change to get these results? A: O(10 lines of code) + Command-line Arguments

Quick overview of the Beam model Clean abstractions hide details PCollection a parallel collection of timestamped elements that are in windows.

Quick overview of the Beam model Clean abstractions hide details PCollection a parallel collection of timestamped elements that are in windows. Sources & Readers produce PCollections of timestamped elements and a watermark.

Quick overview of the Beam model Clean abstractions hide details PCollection a parallel collection of timestamped elements that are in windows. Sources & Readers produce PCollections of timestamped elements and a watermark. ParDo flatmap over elements of a PCollection.

Quick overview of the Beam model Clean abstractions hide details PCollection a parallel collection of timestamped elements that are in windows. Sources & Readers produce PCollections of timestamped elements and a watermark. ParDo flatmap over elements of a PCollection. (Co)GroupByKey shuffle & group {{K: V}} {K: [V]}.

Quick overview of the Beam model Clean abstractions hide details PCollection a parallel collection of timestamped elements that are in windows. Sources & Readers produce PCollections of timestamped elements and a watermark. ParDo flatmap over elements of a PCollection. (Co)GroupByKey shuffle & group {{K: V}} {K: [V]}. Side inputs global view of a PCollection used for broadcast / joins.

Quick overview of the Beam model Clean abstractions hide details PCollection a parallel collection of timestamped elements that are in windows. Sources & Readers produce PCollections of timestamped elements and a watermark. ParDo flatmap over elements of a PCollection. (Co)GroupByKey shuffle & group {{K: V}} {K: [V]}. Side inputs global view of a PCollection used for broadcast / joins. Window reassign elements to zero or more windows; may be data-dependent.

Quick overview of the Beam model Clean abstractions hide details PCollection a parallel collection of timestamped elements that are in windows. Sources & Readers produce PCollections of timestamped elements and a watermark. ParDo flatmap over elements of a PCollection. (Co)GroupByKey shuffle & group {{K: V}} {K: [V]}. Side inputs global view of a PCollection used for broadcast / joins. Window reassign elements to zero or more windows; may be data-dependent. Triggers user flow control based on window, watermark, element count, lateness.

Quick overview of the Beam model Clean abstractions hide details PCollection a parallel collection of timestamped elements that are in windows. Sources & Readers produce PCollections of timestamped elements and a watermark. ParDo flatmap over elements of a PCollection. (Co)GroupByKey shuffle & group {{K: V}} {K: [V]}. Side inputs global view of a PCollection used for broadcast / joins. Window reassign elements to zero or more windows; may be data-dependent. Triggers user flow control based on window, watermark, element count, lateness. State & Timers cross-element data storage and callbacks enable complex operations

1.Classic Batch 2. Batch with Fixed Windows 3. Sessions 4. Streaming 5. Streaming with Speculative + Late Data

Simple clickstream analysis pipeline PCollection<KV<User, Click>> clickstream = pipeline.apply(io.read( )).apply(mapelements.of(new ParseClicksAndAssignUser())); PCollection<KV<User, Long>> usersessions = clickstream.apply(window.into(sessions.withgapduration(minutes(3))).triggering( AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))))).apply(Count.perKey()); usersessions.apply(mapelements.of(new FormatSessionsForOutput())).apply(IO.Write( )); pipeline.run();

Unified unbounded & bounded PCollections pipeline.apply(io.read( )).apply(mapelements.of(new ParseClicksAndAssignUser())); Apache Kafka, ActiveMQ, tailing filesystem, A live, roughly in-order stream of messages, unbounded PCollections. KafkaIO.read().fromTopic( pageviews ) HDFS, Google Cloud Storage, yesterday s Kafka log, Archival data, often readable in any order, bounded PCollections. TextIO.read().from( hdfs://apache.org/pageviews/* )

Windowing and triggers PCollection<KV<User, Long>> usersessions = clickstream.apply(window.into(sessions.withgapduration(minutes(3))).triggering( AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))))) 3:00 3:05 3:10 3:15 3:20 3:25 Event time

Windowing and triggers PCollection<KV<User, Long>> usersessions = clickstream.apply(window.into(sessions.withgapduration(minutes(3))).triggering( AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))))) One session, 3:04-3:25 3:00 3:05 3:10 3:15 3:20 3:25 Event time

Windowing and triggers PCollection<KV<User, Long>> usersessions = clickstream.apply(window.into(sessions.withgapduration(minutes(3))).triggering( AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))))) Processing time Event time 3:00 3:05 3:10 3:15 3:20 3:25

Windowing and triggers PCollection<KV<User, Long>> usersessions = clickstream.apply(window.into(sessions.withgapduration(minutes(3))).triggering( AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))))) Processing time Event time (EARLY) 1 session, 3:04 3:10 3:00 3:05 3:10 3:15 3:20 3:25

Windowing and triggers PCollection<KV<User, Long>> usersessions = clickstream.apply(window.into(sessions.withgapduration(minutes(3))).triggering( AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))))) Processing time Event time (EARLY) (EARLY) 2 sessions, 3:04 3:10 & 3:15 3:20 1 session, 3:04 3:10 3:00 3:05 3:10 3:15 3:20 3:25

Windowing and triggers PCollection<KV<User, Long>> usersessions = clickstream.apply(window.into(sessions.withgapduration(minutes(3))).triggering( AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))))) Processing time Event time (EARLY) (EARLY) 1 session, 3:04 3:25 2 sessions, 3:04 3:10 & 3:15 3:20 1 session, 3:04 3:10 3:00 3:05 3:10 3:15 3:20 3:25

Windowing and triggers PCollection<KV<User, Long>> usersessions = clickstream.apply(window.into(sessions.withgapduration(minutes(3))).triggering( AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))))) Processing time Watermark Event time (EARLY) (EARLY) 1 session, 3:04 3:25 2 sessions, 3:04 3:10 & 3:15 3:20 1 session, 3:04 3:10 3:00 3:05 3:10 3:15 3:20 3:25

Writing output, and bundling usersessions.apply(mapelements.of(new FormatSessionsForOutput())).apply(IO.Write( )); Writing is the dual of reading format and then output. Fault-tolerant side-effects: exploit sink semantics get effectively once delivery: deterministic operations, idempotent operations (create, delete, set), transactions / unique operation IDs, or within-pipeline state.

Two example runs of this pipeline Streaming job consuming Kafka stream Uses 10 workers. Pipeline lag of a few seconds. With a 2 million users over 1 day. Batch job consuming HDFS archive Uses 200 workers. Runs for 30 minutes. Same input. A total of ~4.7M early + final sessions. 240 worker-hours A total of ~2.1M final sessions. 100 worker-hours With Apache Beam, the same pipeline works for both just switch I/O.

Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines

Beam abstractions empower runners Efficiency at runner s discretion Read from this source, splitting it 1000 ways user decides, via trial and error at small scale and hopes that it works Read from this source and the runner decides

Beam abstractions empower runners Efficiency at runner s discretion Read from this source, splitting it 1000 ways user decides, via trial and error at small scale and hopes that it works Read from this source and the runner decides APIs: long getestimatedsize() List<Source> split(size)

Beam abstractions empower runners Efficiency at runner s discretion Read from this source, splitting it 1000 ways user decides, via trial and error at small scale hdfs://logs/* and hopes that it works Read from this source and the runner decides APIs: long getestimatedsize() List<Source> split(size)

Beam abstractions empower runners Efficiency at runner s discretion Read from this source, splitting it 1000 ways user decides, via trial and error at small scale hdfs://logs/* and hopes that it works Read from this source Size? and the runner decides APIs: long getestimatedsize() List<Source> split(size)

Beam abstractions empower runners Efficiency at runner s discretion Read from this source, splitting it 1000 ways user decides, via trial and error at small scale hdfs://logs/* and hopes that it works Read from this source 50 TiB and the runner decides APIs: long getestimatedsize() List<Source> split(size)

Beam abstractions empower runners Efficiency at runner s discretion Read from this source, splitting it 1000 ways user decides, via trial and error at small scale hdfs://logs/* and hopes that it works Read from this source 50 TiB and the runner decides APIs: Cluster utilization? long getestimatedsize() Quota? Bandwidth? List<Source> split(size) Reservations? Bottleneck? Throughput?

Beam abstractions empower runners Efficiency at runner s discretion Read from this source, splitting it 1000 ways user decides, via trial and error at small scale hdfs://logs/* and hopes that it works Read from this source Split 1TiB and the runner decides APIs: Cluster utilization? long getestimatedsize() Quota? Bandwidth? List<Source> split(size) Reservations? Bottleneck? Throughput?

Beam abstractions empower runners Efficiency at runner s discretion Read from this source, splitting it 1000 ways user decides, via trial and error at small scale hdfs://logs/* and hopes that it works Read from this source filenames and the runner decides APIs: Cluster utilization? long getestimatedsize() Quota? Bandwidth? List<Source> split(size) Reservations? Bottleneck? Throughput?

Bundling and runner efficiency A bundle is a group of elements processed and committed together.

Bundling and runner efficiency A bundle is a group of elements processed and committed together. APIs (ParDo/DoFn): startbundle() processelement() n times finishbundle()

Bundling and runner efficiency A bundle is a group of elements processed and committed together. APIs (ParDo/DoFn): startbundle() processelement() n times Transaction finishbundle()

Bundling and runner efficiency A bundle is a group of elements processed and committed together. APIs (ParDo/DoFn): startbundle() processelement() n times Transaction finishbundle() Streaming runner: small bundles, low-latency pipelining across stages, overhead of frequent commits.

Bundling and runner efficiency A bundle is a group of elements processed and committed together. APIs (ParDo/DoFn): startbundle() processelement() n times Transaction finishbundle() Streaming runner: small bundles, low-latency pipelining across stages, overhead of frequent commits. Classic batch runner: large bundles, fewer large commits, more efficient, long synchronous stages.

Bundling and runner efficiency A bundle is a group of elements processed and committed together. APIs (ParDo/DoFn): startbundle() processelement() n times Transaction finishbundle() Streaming runner: small bundles, low-latency pipelining across stages, overhead of frequent commits. Classic batch runner: large bundles, fewer large commits, more efficient, long synchronous stages. Other runner strategies strike a different balance.

Runner-controlled triggering Beam triggers are flow control, not instructions. it is okay to produce data not produce data now. Runners decide when to produce data, and can make local choices for efficiency. Streaming clickstream analysis: runner may optimize for latency and freshness. Small bundles and frequent triggering more files and more (speculative) records. Batch clickstream analysis: runner may optimize for throughput and efficiency. Large bundles and no early triggering fewer large files and no early records.

Pipeline workload varies Workload Time Streaming pipeline s input varies Batch pipelines go through stages

Perils of fixed decisions Workload Time Time Over-provisioned / worst case Under-provisioned / average case

Ideal case Workload Time

The Straggler Problem Work is unevenly distributed across tasks. Worker Reasons: Underlying data. Processing. Runtime effects. Time Effects are cumulative per stage.

Standard workarounds for stragglers Split files into equal sizes? Preemptively over-split? Detect slow workers and re-execute? Worker Sample extensively and then split? Time

Standard workarounds for stragglers Split files into equal sizes? Preemptively over-split? Detect slow workers and re-execute? Worker Sample extensively and then split? Time All of these have major costs, none is a complete solution.

No amount of upfront heuristic tuning (be it manual or automatic) is enough to guarantee good performance: the system will always hit unpredictable situations at run-time. A system that's able to dynamically adapt and get out of a bad situation is much more powerful than one that heuristically hopes to avoid getting into it.

Beam readers enable dynamic adaptation Readers provide simple progress signals, enable runners to take action based on execution-time characteristics. APIs for how much work is pending: Bounded: double getfractionconsumed() Unbounded: long getbacklogbytes() Work-stealing: Bounded: Source splitatfraction(double) int getparallelismremaining()

Dynamic work rebalancing Done work Active work Predicted completion Tasks Time Now Average

Dynamic work rebalancing Done work Active work Predicted completion Tasks Time Now

Dynamic work rebalancing Done work Active work Predicted completion Tasks Now Now

Dynamic work rebalancing Done work Active work Predicted completion Tasks Now Now

Dynamic Work Rebalancing: a real example Beam pipeline on the Google Cloud Dataflow runner 2-stage pipeline, split evenly but uneven in practice

Dynamic Work Rebalancing: a real example Beam pipeline on the Google Cloud Dataflow runner 2-stage pipeline, split evenly but uneven in practice Same pipeline dynamic work rebalancing enabled

Dynamic Work Rebalancing: a real example Beam pipeline on the Google Cloud Dataflow runner Savings 2-stage pipeline, split evenly but uneven in practice Same pipeline dynamic work rebalancing enabled

Dynamic Work Rebalancing + Autoscaling Beam pipeline on the Google Cloud Dataflow runner

Dynamic Work Rebalancing + Autoscaling Beam pipeline on the Google Cloud Dataflow runner Initially allocate ~80 workers based on size

Dynamic Work Rebalancing + Autoscaling Beam pipeline on the Google Cloud Dataflow runner Initially allocate ~80 workers based on size Multiple rounds of upsizing enabled by dynamic splitting

Dynamic Work Rebalancing + Autoscaling Beam pipeline on the Google Cloud Dataflow runner Initially allocate ~80 workers based on size Multiple rounds of upsizing enabled by dynamic splitting Upscale to 1000 workers * tasks stay well-balanced * without oversplitting initially

Dynamic Work Rebalancing + Autoscaling Beam pipeline on the Google Cloud Dataflow runner Initially allocate ~80 workers based on size Long-running tasks Multiple rounds of upsizing enabled aborted without causing stragglers by dynamic splitting Upscale to 1000 workers * tasks stay well-balanced * without oversplitting initially

Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines

Write a pipeline once, run it anywhere Unified model for portable execution PCollection<KV<User, Click>> clickstream = pipeline.apply(io.read( )).apply(mapelements.of(new ParseClicksAndAssignUser())); PCollection<KV<User, Long>> usersessions = clickstream.apply(window.into(sessions.withgapduration(minutes(3))).triggering( AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))))).apply(Count.perKey()); usersessions.apply(mapelements.of(new FormatSessionsForOutput())).apply(IO.Write( )); pipeline.run();

Get involved with Beam If you analyze data: Try it out! If you have a data storage or messaging system: Write a connector! If you have Big Data APIs: Write a Beam SDK, DSL, or library! If you have a distributed processing backend: Write a Beam Runner! (& join Apache Apex, Flink, Spark, Gearpump, and Google Cloud Dataflow)

Apache Beam at this conference Using Apache Beam for Batch, Streaming, and Everything in Between Dan Halperin @ 10:15 am Apache Beam: Integrating the Big Data Ecosystem Up, Down, and Sideways Davor Bonaci, and Jean-Baptiste Onofré @ 11:15 am Concrete Big Data Use Cases Implemented with Apache Beam Jean-Baptiste Onofré @ 12:15 pm Nexmark, a Unified Framework to Evaluate Big Data Processing Systems Ismaël Mejía, and Etienne Chauchot @ 2:30 pm

Apache Beam at this conference Apache Beam Birds of a Feather Wednesday, 6:30 pm - 7:30 pm Apache Beam Hacking Time Time: all-day Thursday 2nd floor collaboration area (depending on interest)

Learn more Apache Beam website: https://beam.apache.org/ documentation, mailing lists, downloads, etc. Read our blog posts! Streaming 101 & 102: oreilly.com/ideas/the-world-beyond-batch-streaming-101 No shard left behind: Dynamic work rebalancing in Google Cloud Dataflow Apache Beam blog: http://beam.apache.org/blog/ Follow @ApacheBeam on Twitter