Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

Size: px

Start display at page:

Download "Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_"

Augustus Bailey
6 years ago
Views:

1 Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

2 About Us At GetInData, we build custom Big Data solutions Hadoop, Flink, Spark, Kafka and more Our team is today represented by Adam Kawa Dawid Wysakowicz

3 Have you ever built cool Big Data pipelines?

5 Example Use-Case Can be done in batch and real-time User session analytics at Spotify Simple stats Duration, number of songs, skips, searches etc. Advanced analytics Mood, physical activity, real-time content, ads

6 Example Output _1. Dashboards_ How long do users listen to a new edition of Discover Weekly?

7 Example Output _1. Dashboards 2. Alerts_ How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short!!!

8 Example Output _1. Dashboards 2. Alerts 3. Content_ How long do users listen to a new edition of Discover Weekly? Australian users are listening to Discover Weekly too short!!! Recommend songs and ads based on current activity.

9 1st - Batch Architecture 1h 1h - 1d 1h User Events 1h 1h User Sessions

10 1st - Batch Architecture 1h 1d 1h User Events 1h 1h User Sessions

11 Many Moving Parts The higher learning curve The more gluing code The larger administrative effort The more error-prone solution

12 Long Waiting Time Image source: Continuous Analytics: Stream Query Processing in Practice, Michael J Franklin, Professor, UC Berkley, Dec 2009 and

13 2nd - Micro-Batch Architecture 15m - 1h

14 No Built-In Session Windows [9:00-9:15) [9:15-9:30)

15 No Built-In Session Windows [9:00-9:15) [11:00-12:00)

16 Late Data 14:55-16:45 Event Time Processing Time

17 ... Included in Current Batch 16:55-14:55-16:45 Event Time Processing Time

18 Out-Of-Order Data Event Time Processing Time

19 Out-Of-Order Data Event Time Processing Time

20 Out-Of-Order Data Event Time Processing Time

21 ... Breaks Correctness Event Time Processing Time

22 Problems FILES, BATCHES, DATA LAKES

23 Solving Streaming Problem With Batch?

24 3rd - Streaming-First Architecture

25 User Session Windows 5 Case A Case B Session gap eg. 15 minutes

26 Reading From Kafka val sessionstream : DataStream[SessionStats] = senv.addsource(new KafkaConsumer(...))

27 Session Windows With Gap val sessionstream : DataStream[SessionStats] = senv.addsource(new KafkaConsumer(...)).keyBy(_.userId) User 1 User 2

28 Session Windows With Gap val sessionstream : DataStream[SessionStats] = senv.addsource(new KafkaConsumer(...)).keyBy(_.userId).window(EventTimeSessionWindows.withGap(Time.minutes(15))) User 1 Session gap - 15 minutes

29 Analyzing User Session val sessionstream : DataStream[SessionStats] = senv.addsource(new KafkaConsumer(...)).keyBy(_.userId).window(EventTimeSessionWindows.withGap(Time.minutes(15))).apply(new CountSessionStats()) User 1

30 Handling Late Events val sessionstream : DataStream[SessionStats] = senv.addsource(new KafkaConsumer(...)).keyBy(_.userId).window(EventTimeSessionWindows.withGap(Time.minutes(15))).allowedLateness(Time.minutes(60)).apply(new CountSessionStats()) User 1

31 Triggering Early Results val sessionstream : DataStream[SessionStats] = senv.addsource(new KafkaConsumer(...)).keyBy(_.userId).window(EventTimeSessionWindows.withGap(Time.minutes(15))).trigger(EarlyTriggeringTrigger.every(Time.minutes(10))).allowedLateness(Time.minutes(60)).apply(new CountSessionStats()) User 1

32 Sessionization Example val sessionstream : DataStream[SessionStats] = senv.addsource(new KafkaConsumer(...)).keyBy(_.userId).window(EventTimeSessionWindows.withGap(Time.minutes(15))).trigger(EarlyTriggeringTrigger.every(Time.minutes(10))).allowedLateness(Time.minutes(60)).apply(new CountSessionStats()) Working example:

33 Modern Stream Processing Engines Rich stream processing semantic Built-in support for event-time windows Accurate results for late / out-of-order events and replays Early triggers Low latency and high-throughput Exactly-once stateful processing

34 Modern Stream Processing Engines Rich stream processing semantic Built-in support for event-time windows Accurate results for late / out-of-order events and replays Early triggers Low latency and high-throughput Exactly-once stateful processing User survey:

36 How can I reprocess data?

37 Reprocessing Events In Flink Take periodic snapshots of a job It stores Kafka offsets, on-flight sessions, application state 2. Restart a job from a savepoint rather than from a beginning 1.

38 What if data is no longer in Kafka?

39 Consuming Data From HDFS Run your streaming code on HDFS (bounded data) You need to read data in event-time based order Implement mechanism of proper watermark generation

40 What are usual stream processing applications?

41 Stream Analytics Image source:

42 Stream 24/7 Applications

43 What can happen if you stick to batch?

44 Real-Time Personalization Image source: 4-million-users

45 When is batch processing good?

46 Batch Processing Use-Cases Ad-hoc analytics and data exploration Notebooks, Spark/Flink/Hive, Parquet Complete data sets Technical advantages A large swaths of historical data in HDFS High-level libraries in mature batch technologies

47 Batch or Stream? Stream is often a natural representation of data Stream processing is not only about low latency Correctness, expressive API, simplicity Stream processing is a great fit for ETL, KPIs, reports

48 Take-Away Message Stream is often a natural representation of data (when data arrives continuously) Stream processing is not only about low latency Correctness, expressive API, simplicity Stream processing is a great fit for ETL, KPIs, reports don t solve streaming problem with batch jobs

49 Thanks Vilnius!

50 Q&A

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes AN UNDER THE HOOD LOOK Databricks Delta, a component of the Databricks Unified Analytics Platform*, is a unified