Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Size: px

Start display at page:

Download "Down the event-driven road: Experiences of integrating streaming into analytic data platforms"

Arabella Simmons
5 years ago
Views:

1 Down the event-driven road: Experiences of integrating streaming into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Meetup Munich,

2 Integrate existing (batch) data sources? Check consistency with data sources? Build realtime data visualizations? 2

3 Down the event-driven road.. Analytic (Streaming) Data Platforms Integrating existing (batch) data sources Checking consistency Wrap up & Summary Building realtime visualizations 3

4 A typical analytic data platform SQL, Notebooks (Zeppelin,..) (Hive) Tables user access, system integration, development Batch Processing (Spark, Hive,..) ingress raw processed datahub analysis egress Flat files, Databases, APIs,... Scheduling, orchestration, metadata Airflow, Hive Metastore 4

5 A typical (?) streaming data platform KSQL (Kafka) Topics, KTables,.. user access, system integration, development Stream Processing (Kafka Streams, Nifi,..) Kafka Connect ingress raw processed datahub analysis egress Input Data (Streams) Scheduling, orchestration, metadata (Confluent) Schema Registry 5

6 Down the event-driven road.. Analytic (Streaming) Data Platforms Integrating existing (batch) data sources Checking consistency Wrap up & Summary Building realtime visualizations 6

7 Integrating web tracking company website tracking pixel tracking service raw tracking data 7

8 Integrating web tracking: setup / constraints Hortonworks-based platform, including Nifi and Confluent Platform Apache Airflow established scheduling / workflow tool, integrated into monitoring, alerting,.. Tracking Service: Currently batch-oriented API (request data, get download links,..), but click event stream planned Developers / Analysts with mixed background w.r.t. programming skills 8

9 Apache Nifi in a Nutshell drag-and-drop visual definition of data pipelines various built-in connectors (file, stream, database, service,...) event-based processing paradigm built-in queues, data provenance, backpressure handling, registry,... focus: ingest & lightweight (!) transformation not a complex event processor (like Kafka Streams, Flink, Spark Streaming,...) integrated into HDP stack 9

10 Apache Airflow in a nutshell python library to define & schedule batch workflows programmatic specification of a DAG (= tasks + dependencies) clean handling of job run metadata (success, duration,..) developed by AirBnB, open-sourced 2015 built-in standard operators (bash, hive, spark, kubernetes,..) easily extendible (custom operators,..) once used -> never Oozie again J 10

11 Integrating web tracking: options Option Aspects tracking data tracking service Airflow only + integrated into monitoring,.. + job status handling, reloading - not prepared for future stream API - handling file content complicated Unified Abstraction (e.g. Apache Beam) Nifi only Kafka-Connect + one model for batch / stream ingest - comparatively high entry barrier + visual pipeline definition + easy handling of file content + event-based paradigm + operators available - custom status handling, reloading + fault-tolerant + scalable setup - custom connector coding - custom status handling, reloading 11

Integrating web tracking: chosen solution Airflow + Nifi tracking service trigger, fetch download links Combines advantages of Airflow & Nifi download, process,

12 Integrating web tracking: chosen solution Airflow + Nifi tracking service trigger, fetch download links Combines advantages of Airflow & Nifi download, process, store data check status (sensors) trigger (hourly) download Prepared for future streaming API Integrated into monitoring, alerting,.. Status handling / reloading easy 12

13 Down the event-driven road.. Analytic (Streaming) Data Platforms Integrating existing (batch) data sources Checking consistency Wrap up & Summary Building realtime visualizations 13

14 Checking consistency: Customer Consent grants / revokes consent customer portal stores consent consent event kafka writes consent to hive Customer (consent) database in sync? 14

15 Checking consistency: setup / constraints Analysts need up-to-date version of customer consent information in platform Hard correctness requirements (especially regarding revoked consent) Continuous monitoring of correctness Alerting in case of differences 15

Checking Consistency: Statistics Events time {type:grant, cid:12, ts:2018-10-01 11:00:00.

$.} kafka {type=stat, measure_ts=2018-10-01 11:01:20, stats={num_consent_v1:72625, num_consent_v2: 6252,.$

16 Checking Consistency: Statistics Events time {type:grant, cid:12, ts: :00:00..} customer portal {type:grant, cid:10, ts: :01:00..} {type:revok, cid:09, ts: :01:05..} kafka {type=stat, measure_ts= :01:20, stats={num_consent_v1:72625, num_consent_v2: 6252,..} } use existing channel (kafka) source inject periodic statistics events into stream with defined measure point (in time) 16

Checking Consistency: Evaluate Statistics

${type=stat, measure_ts=2018-10-01 11:01:20,$ num_consent_v2: 6252,..} } in sync?

$hive_stats={ num_consent_v1:72625,$

17 Checking Consistency: Evaluate Statistics Event Custome r (consent ) database {type=stat, measure_ts= :01:20, stats={num_consent_v1:72625, num_consent_v2: 6252,..} } in sync? perform count on target side (Hive) up to $measurepoint compare counts { } measure_ts= :01:20, hive_stats={ num_consent_v1:72625, num_consent_v2: 6252,..} counts = simple plausibility check, but more elaborated checks (hashes) thinkable 17

18 Down the event-driven road.. Analytic (Streaming) Data Platforms Integrating existing (batch) data sources Checking consistency Wrap up & Summary Building realtime visualizations 18

19 Realtime visualizations: Online Shop Purchases online shop normalization, filtering, aggregation,.. purchase event JMS realtime dashboard 19

20 Realtime visualizations: setup / constraints Goal: timely insights into various purchase aspects (items bought last 5min,..) flexible / configurable frontend (time window, aggregation dimension,..) scalable to 100s / 1000s of dashboard users low latency of dashboard backend 20

Realtime visualizations: components / options service API Spring Boot Phoenix / JDBC aggregation at query-time Spring Boot Phoenix / JDBC Spring Boot Built-in, configurable aggregation

21 Realtime visualizations: components / options service API Spring Boot Phoenix / JDBC aggregation at query-time Spring Boot Phoenix / JDBC Spring Boot Built-in, configurable aggregation service backend HBase HBase Druid transport layer Kafka-connect Kafka-connect Tranquility processing Kafka Kafka-streams Kafka Kafka Kafka-connect Nifi Nifi JM S aggregation during processing 21

22 Realtime visualizations: chosen solution Spring Boot Druid Druid: time series database with focus on Realtime ingestion, good Kafka integation slice-and-dice queries distributed scale-out architecture Tranquility Kafka Event processing kept simple in Nifi mainly cleaning, transformation aggregation is pushed down to Druid Nifi JM S But: yet another distributed system.. L Experiences good so far, but needs work / skills 22

23 Down the event-driven road.. Analytic (Streaming) Data Platforms Integrating existing (batch) data sources Checking consistency Wrap up & Summary Building realtime visualizations 23

24 The human factor.. Technology moves from batch to stream what about people? Analysts world = often batch world tooling centered around static datasets can (and must) be generated from streams but: education towards stream / event-based thinking necessary! Incremental / stream-based data exchange = paradigm shift efforts / commitment from both ends necessary 24

25 Stream me up, Scotty.. The future is event-based, but on the way: Existing batch-oriented APIs use (scheduled) event-based tools for easier later migration Checking consistency inject plausibility checks into data stream Realtime visualizations Druid + Kafka powerful and flexible combination Don t forget the human in the loop! 25

26 Vielen Dank Dr. Dominik Benz inovex GmbH Park Plaza Ludwig-Erhard-Allee Karlsruhe

Flow is in the Air: Best Practices of Building Analytical Data Pipelines with Apache Airflow

Flow is in the Air: Best Practices of Building Analytical Data Pipelines with Apache Airflow Dr. Dominik Benz, inovex GmbH PyConDe Karlsruhe, 27.10.2017 Diving deep in the analytical data lake? Dependencies