The Stream Processor as a Database. Ufuk

Size: px

Start display at page:

Download "The Stream Processor as a Database. Ufuk"

Lillian Jackson
5 years ago
Views:

1 The Stream Processor as a Database Ufuk

2 Realtime Counts and Aggregates The (Classic) Use Case 2

3 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3

4 The Architecture collect message queue analyze serve & store 4

5 The Flink Job case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env.addsource(new FlinkKafkaConsumer09( )) val impressions: DataStream[Impressions] = events.filter(evt => evt.isimpression).map(evt => Impressions(evt.id, evt.numimpressions) val counts: DataStream[Impressions]= stream.keyby("id").timewindow(time.hours(1)).sum("impressions") 5

6 The Flink Job case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env.addsource(new FlinkKafkaConsumer09( )) val impressions: DataStream[Impressions] = events.filter(evt => evt.isimpression).map(evt => Impressions(evt.id, evt.numimpressions) val counts: DataStream[Impressions]= stream.keyby("id").timewindow(time.hours(1)).sum("impressions") 6

7 The Flink Job case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env.addsource(new FlinkKafkaConsumer09( )) val impressions: DataStream[Impressions] = events.filter(evt => evt.isimpression).map(evt => Impressions(evt.id, evt.numimpressions) val counts: DataStream[Impressions]= stream.keyby("id").timewindow(time.hours(1)).sum("impressions") 7

8 The Flink Job case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env.addsource(new FlinkKafkaConsumer09( )) val impressions: DataStream[Impressions] = events.filter(evt => evt.isimpression).map(evt => Impressions(evt.id, evt.numimpressions) val counts: DataStream[Impressions]= stream.keyby("id").timewindow(time.hours(1)).sum("impressions") 8

9 The Flink Job case class Impressions(id: String, impressions: Long) val events: DataStream[Event] = env.addsource(new FlinkKafkaConsumer09( )) val impressions: DataStream[Impressions] = events.filter(evt => evt.isimpression).map(evt => Impressions(evt.id, evt.numimpressions) val counts: DataStream[Impressions]= stream.keyby("id").timewindow(time.hours(1)).sum("impressions") 9

10 The Flink Job State Kafka Source filter() map() keyby() window()/ sum() Sink Kafka Source filter() map() keyby() window()/ sum() Sink State 10

11 Putting it all together Periodically (every second) flush new aggregates to Redis 11

12 The Bottleneck Writes to the key/value store take too long 12

13 Queryable State 13

14 Queryable State 14

15 Queryable State Optional, and only at the end of windows 15

16 Queryable State: Application View Application Query Service current time windows past time windows Database realtime results older results 16

17 Queryable State Enablers Flink has state as a first class citizen State is fault tolerant (exactly once semantics) State is partitioned (sharded) together with the operators that create/update it State is continuous (not mini batched) State is scalable 17

18 State in Flink Events flow without replication or synchronous writes State index (e.g., RocksDB) Events are persistent and ordered (per partition / key) in the message queue (e.g., Apache Kafka) Source / filter() / map() window()/ sum() 18

19 State in Flink Trigger checkpoint Inject checkpoint barrier Source / filter() / map() window()/ sum() 19

20 State in Flink Take state snapshot Trigger state copy-on-write Source / filter() / map() window()/ sum() 20

21 State in Flink Persist state snapshots Processing pipeline continues Durably persist snapshots asynchronously Source / filter() / map() window()/ sum() 21

22 Queryable State: Implementation Query: /job/state-name/key (2) Look up location (1) Get location of "key-partition" of" job" (3) Respond location State Location Server ExecutionGraph deploy status Query Client State Registry window()/ sum() (4) Query state-name and key State Registry window()/ sum() register local state Job Manager Task Manager Task Manager 22

23 Queryable State Performance 23

24 Conclusion 24

25 Takeaways Streaming applications are often not bound by the stream processor itself. Cross system interaction is frequently biggest bottleneck Queryable state mitigates a big bottleneck: Communication with external key/value stores to publish realtime results Apache Flink's sophisticated support for state makes this possible 25

26 Takeaways Performance of Queryable State Data persistence is fast with logs Append only, and streaming replication Computed state is fast with local data structures and no synchronous replication Flink's checkpoint method makes computed state persistent with low overhead 26

27 Questions? Code/Demo: 27

28 Appendix 28

29 Flink Runtime + APIs Table API & Stream SQL DataStream API ProcessFunction API Runtime Distributed Streaming Data Flow Building Blocks: Streams, Time, State 29

30 Apache Flink Architecture Review 30

Streaming Analytics with Apache Flink. Stephan

Streaming Analytics with Apache Flink Stephan Ewen @stephanewen Apache Flink Stack Libraries DataStream API Stream Processing DataSet API Batch Processing Runtime Distributed Streaming Data Flow Streaming