Apache Beam. Modèle de programmation unifié pour Big Data

Size: px

Start display at page:

Download "Apache Beam. Modèle de programmation unifié pour Big Data"

Erica Logan
6 years ago
Views:

1 Apache Beam Modèle de programmation unifié pour Big Data

Who am I? Jean-Baptiste Onofre <jbonofre@apache.org> <jbonofre@talend.com> @jbonofre http://blog.nanthrax.

2 Who am I? Jean-Baptiste Onofre @jbonofre Member of the Apache Software Foundation Fellow/Software Architect at Talend PMC on ~20 Apache Projects from system integration & container (Karaf, Camel, ActiveMQ, Archiva, Aries, ServiceMix, ) to big data (Beam, CarbonData, Falcon, Gearpump, Lens, )

3 Apache Beam origin Colossus BigTable PubSub Dremel Google Cloud Dataflow Spanner Megastore Millwheel Flume Apache Beam MapReduce

4 Beam model: asking the right questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate?

5 Customizing What Where When How 1 Classic Batch 2 Windowed Batch 3 Streaming 4 Streaming + Accumulation

6 What is Apache Beam? 1. Unified model (Batch + stream) What / Where / When / How 2. SDKs (Java, Python,...) & DSLs (Scala, ) 3. Runners for Existing Distributed Processing Backends (Google Dataflow, Spark, Flink, ) 4. IOs: Data store Sources / Sinks

7 Apache Beam vision 1. End users: who want to write pipelines in a language that s familiar. Beam Java Other Languages Beam Python 2. SDK/DSL writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Pipeline Construction Apache Flink Cloud Dataflow Beam Model: Fn Runners Apache Spark Execution Execution Execution

8 Complex Event Processing Apache Beam - SDKs & DSLs SDKs API based on the Beam Model 1. Current: a. Java b. Python 2. Future (possible) SDKs: Go, Ruby, etc. DSLs Domain-Specific Languages based on the Beam Model: 1. Current: Scio (Scala API), 2. Future (ideas): Streaming SQL (Calcite) Machine Learning

9 Apache Beam SDK concepts 1. Pipeline - data processing job as a directed graph of transformations 2. PCollection - the data inside a pipeline 3. PTransform - a transformation step in the pipeline a. IO transforms - Read from a Source or Write to a Sink. b. Core transforms - common transformation provided (ParDo, GroupByKey, ) c. Composite transforms - combine multiple transforms

10 Apache Beam - Pipeline Data processing pipeline (executed via a Beam runner) Read PTransform (source) PTransform PTransform Write PTransform (sink)

11 Apache Beam - PCollection 1. PCollection is immutable, does not support random access to element, belongs to a Pipeline 2. Each element in PCollection has a Timestamp (commonly set by IO Source) 3. Coder to support different data serialization 4. Bounded (batch) or Unbounded (streaming) (depending of the IO Source)

12 Apache Beam - PTransform 1. PTransform are operations that transform data 2. Receive one or multiple PCollections and produce one or multiple PCollections 3. They must be Serializable 4. Should be thread-compatible (If you create your threads you must sync them). 5. Idempotency is not required but recommended.

13 Apache Beam - IO Transforms 1. IO read/write data as PCollections (Source/Sink) 2. Support Bounded and/or Unbounded PCollections 3. Extensible API to create custom sources & sinks 4. Deal with timestamp, watermarks, deduplication, read/write parallelism

14 Agenda 1. Evolution of the Big Data programming models 2. The Beam approach 3. Apache Beam

15 Apache Beam - Current IOs Ready File Avro Google Cloud Storage BigQuery BigTable DataStore MQTT JDBC Mongo / GridFS JMS Kafka Kinesis WIP Hive Cassandra Reddis RabbitMQ... HDFS Elasticsearch HBase

16 Apache Beam - Pipeline with IO Example public static void main(string[] args) { // Create a pipeline parameterized by command line flags eg. --runner Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg)); p.apply(kafkaio.read().withbootstrapservers(servers).withtopics(topics)) // Read input.apply(new YourFancyFn()) // Do some processing.apply(elasticsearchio.write().withaddress(esserver).withindex(index).withtype(type)); // Write output // Run the pipeline. p.run(); }

17 What are you computing? Element-Wise Aggregating Composite

18 Apache Beam - Programming model in the SDK Element-wise ParDo MapElements FlatMapElements Filter WithKeys Keys Values Grouping GroupByKey Combine -> Reduce Sum Count Min Max Mean... Windowing/Triggers FixedWindows GlobalWindows SlidingWindows Sessions AfterWatermark AfterProcessingTime AfterPane...

19 Apache Beam - Example - GDELT Events by location Pipeline pipeline = Pipeline.create(options); // Read events from a text file and parse them. pipeline.apply("gdeltfile", TextIO.Read.from(options.getInput())) // Extract location from the fields.apply("extractlocation", ParDo.of(...) // Count events per location.apply("countperlocation", Count.<String>perElement()) // Reformat KV as a String.apply("StringFormat", MapElements.via(...)) // write to result files.apply("results",textio.write.to(options.getoutput())); // Run the batch pipeline. pipeline.run();

20 Apache Bean - Runners / Execution Engines Runners translate the code to a target runtime (the runner itself doesn t provide the runtime) Many runners are tied to other top-level Apache projects, such as Apache Flink and Apache Spark Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability

21 Runners Apache Beam Direct Runner Local Google Cloud Dataflow Managed (NoOps) Apache Spark Apache Flink WIP Apache Apex Apache Gearpump Apache MapReduce Apache Karaf Same code, different runners & runtimes

22 Apache Beam - Use cases Apache Beam is a great choice for both batch and stream processing and can handle bounded and unbounded datasets Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on Stream can focus on handling real-time processing on a record-by-record basis Real use cases Data processing, both batch and stream processing Real-time event processing from IoT devices Fraud detection,...

23 Why Apache Beam? 1. Portable - You can use the same code with different runners (agnostic) and backends on premise, in the cloud, or locally 2. Unified - Same unified model for batch and stream processing 3. Advanced features - Event windowing, triggering, watermarking, lateness, etc. 4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel

24 Growing the Beam Community Collaborate - Beam is becoming a communitydriven effort with participation from many organizations and contributors Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem

25 Learn More! Apache Beam Join the Beam mailing lists! on Twitter

26 Thank You!

Introduction to Apache Beam

Introduction to Apache Beam Dan Halperin JB Onofré Google Beam podling PMC Talend Beam Champion & PMC Apache Member Apache Beam is a unified programming model designed to provide efficient and portable