Modern Stream Processing with Apache Flink

Size: px

Start display at page:

Download "Modern Stream Processing with Apache Flink"

Baldwin Rice
5 years ago
Views:

1 1

2 Modern Stream Processing with Apache Flink Till Rohrmann GOTO Berlin

3 Original creators of Apache Flink da Platform 2 Open Source Apache Flink + da Application Manager 3

4 What changes faster? Data or Query? Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and (hyper) parameter tuning Batch Processing Use Case Data changes fast application logic is long-lived continuous applications, data pipelines, standing queries, anomaly detection, ML evaluation, Stream Processing Use Case 4

5 Batch Processing 5

6 Stream Processing 6

7 Apache Flink in a Nutshell 7

8 Apache Flink in a Nutshell Stateful computations over streams real-time and historic fast, scalable, fault tolerant, in-memory, event time, large state, exactly-once Queries Application Applications Devices etc. Streams Historic Data Database Stream File / Object Storage 8

9 The Core Building Blocks Event Streams State (Event) Time Snapshots real-time and hindsight complex business logic consistency with out-of-order data and late data forking / versioning / time-travel 9

10 Powerful Abstractions Layered abstractions to navigate simple to complex use cases High-level Analytics API Stream- & Batch Data Processing Stream SQL / Tables (dynamic tables) DataStream API (streams, windows) val stats = stream.keyby("sensor").timewindow(time.seconds(5)).sum((a, b) -> a.add(b)) Stateful Event- Driven Applications Process Function (events, state, time) def processelement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { } out.collect( ) // emit events state.update( ) // modify state // schedule a timer callback ctx.timerservice.registereventtimetimer(event.timestamp + 500) } 10

11 Hardened at scale Athena X Streaming SQL Platform Service Streaming Platform as a Service 100s jobs, 1000s nodes, TBs state metrics, analytics, real time ML Streaming SQL as a platform Fraud detection Streaming Analytics Platform 11

12 Distributed application infrastructure 12

13 Good old centralized architecture Application Application Application Application $$$ The big mean central database The grumpy DBA 13

14 Modern distributed app. architecture Application Application Sensor Application APIs Application The limit to what you can do is how sophisticated can you compute over the stream! Boils down to: How well do you handle state and time 14

15 A Flink-favored approach 15

16 Event Sourcing + Memory Image event / command periodically snapshot the memory event log persists events (temporarily) main memory update local variables/structures Process 16

17 Event Sourcing + Memory Image Recovery: Restore snapshot and replay events since snapshot event log persists events (temporarily) Process 17

18 Distributed Memory Image Distributed application, many memory images. Snapshots are all consistent together. 18

19 Stateful Event & Stream Processing Scalable embedded state Access at memory speed & scales with parallel operators 19

20 Compute, State, and Storage Classic tiered architecture compute layer Streaming architecture compute + application state database layer application state + backup stream storage and snapshot storage (backup) 20

21 Performance Classic tiered architecture Streaming architecture synchronous reads/writes across tier boundary all modifications are local asynchronous writes of large blobs 21

22 Consistency Classic tiered architecture Streaming architecture exactly once per state =1 =1 distributed transactions at scale typically at-most / at-least once 22

23 Scaling a Service Classic tiered architecture Streaming architecture provision compute provision compute and state together separately provision additional database capacity 23

24 Rolling out a new Service Classic tiered architecture Streaming architecture provision compute and state together provision a new database (or add capacity to an existing one) simply occupies some additional backup space 24

25 What users built on checkpoints Upgrades and Rollbacks Cross Datacenter Failover State Archiving Application Migration Spot Instance Region Chasing A/B testing 25

26 What is the next wave of stream processing applications? 26

27 What changes faster? Data or Query? Data changes slowly compared to fast changing queries ad-hoc queries, data exploration, ML training and tuning Batch Processing Use Case Data changes fast application logic is long-lived continuous applications, data pipelines, standing queries, anomaly detection, ML evaluation, Stream Processing Use Case 27

28 Analytics & Business Logic Source events analytics metrics Business Logic Stream Stream Processor Database Application 28

29 Blending Analytics & Business Logic Source events Stream Streaming Application Stream Processor RPCs (Actor) Messages 29

30 AthenaX by Uber 30

31 Can one build an entire sophisticated web application (say a social network) on a stream processor? (Yes, we can! ) 31

on Kafka/Flink/Elasticsearch/Redis More:

32 @ Social network implemented using event sourcing and CQRS (Command Query Responsibility Segregation) on Kafka/Flink/Elasticsearch/Redis More: 32

33 The next wave of stream processing applications is all types of stateful applications that react to data and time! 33

34 Stateful stream applications Continuous applications versioning, upgrading, rollback, duplicating, migrating, 34

35 The da Platform Architecture Real-time Analytics Anomaly- & Fraud Detection Real-time Data Integration Reactive Microservices (and more) Streams from Kafka, Kinesis, S3, HDFS, Databases,... da Application Manager Application lifecycle management da Platform 2 Apache Flink Stateful stream processing Logging Metrics CI/CD Kubernetes Container platform 35

36 Versioned Applications, not Jobs/Jars Stream Processing Application New Application Version 3 Version 3a upgrade upgrade Version 2 Version 1 fork / duplicate Version 2a Code and Application Snapshot 36

37 Deployments, not Flink Clusters Threat Metrics App. Testing Fraud Detection App. Application Testing Activity Monitor Application Testing / QA Kubernetes Cluster Production Kubernetes Cluster 37

38 Hooks for CI/CD pipelines stateful upgrade trigger CI upgrade API da Application Manager Application Version 12 push update CI Service Kubernetes Cluster 38

39 @dataartisans 39

40 We are hiring! data-artisans.com/careers 40

41 41

The Power of Snapshots Stateful Stream Processing with Apache Flink

The Power of Snapshots Stateful Stream Processing with Apache Flink Stephan Ewen QCon San Francisco, 2017 1 Original creators of Apache Flink da Platform 2 Open Source Apache Flink + da Application Manager