Let the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka. Materna GmbH

Size: px

Start display at page:

Download "Let the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka. Materna GmbH"

Lambert Heath
5 years ago
Views:

1 Let the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka

Mehr Qualität in Software, Netzwerker, Innovator Frank

2 Wer ist Frank Pientka? Dipl.-Informatiker (TH Karlsruhe) Verheiratet, 2 Töchter Principal Software Architect in Dortmund Fast 30 Jahre IT-Erfahrung Projekte, Veröffentlichungen und Vorträge Mehr Qualität in Software, Netzwerker, Innovator Frank Pientka, Dipl.-Informatiker frank.pientka@materna.de +49 (231) (1570)

3 Agenda The need for speed fast data Two worlds message & data together Why Kafka? What is Kafka? Cluster Messaging Clients Connecting Streaming Confluent use cases, platform Kafka steps Resume 3

4 Big data - fast data 4

5 Three Vs of Big Data Velocity Variety Volume 5

6 The data value chain Single Data Item Aggregate Data Value Data Value Close the gap! Age of Data 6

7 The lambda architecture for big data analysis Data storage Batch processing Batch layer (volume) Data source Presentation Serving layer Data queuing Real-time processing Speed layer (velocity)

8 Kappa architecture for fast data anylytics Data source Data queuing Speed layer (velocity) Real-time processing Presentation Serving layer 8

9 Big data - fast data: The need for speed Stream Mini-Batch Query Batch 9

10 What is? Since 2011 LinkedIn Apache 2012 Confluent 2014 Writen in Java & Scala Kafka 0.11 Streaming 2017 Kafka 1.1 March 28, 2018 Kafka 1.1.1, 2.0 planed

11 11

12 Messaging with Kafka Broker Topic A Key Value Time Topic A Topic B Producer Consumer Intermediate Topic State Store Message format CRC attributes keylength keycontent messagelength message -content 12

13 Topics in 3 partitions with 3 replicas order of messages within a partition are guaranteed by key 13

14 Distributed partitions (P0-P3) parallel processed by consumer groups (C1-C6) groups spilt on partitions for read parallelization 14

15 Consumer groups subscribed to a topic with parallel reads rebalancing 15

16 Last commit offset, current read Client offset, High watermark, Log end offset 16

17 Producer, consumer, offset, retention period Messages are retained Consumer knows his position Horizontal scaling 17

18 Topics and partioned logs writes in a cluster with horizontal scalability Producers 18

19 Log Compaction Basics 19

20 Log Compaction Basics 20

21 Kafka Cluster single node multiple broker: Zookeeper, Producer, Consumer groups Highly scalable, available and distributed Producer Streaming Zookeeper 2181 Get Cluster topic infos Kafka Broker 9092 Update Consumed Message offset Consumer1 (Group1) Consumer2 (Group1) Consumer3 (Group2) Queue Topology Topic Topology Benefits Costs Scalability (size and speed) Big/FastData Availability (distribution, backpressure?) Message ordering retention Consumer4 (Group2) 21

22 Kafka consistency and failover with leader and follower replicas bin/kafka-topics.sh create zookeeper localhost:2181 replication-factor 3 partitions 3 topic MultiBrokerTopic

23 Kafka consistency and failover from broker 1 to 2 bin/kafka-console-producer.sh broker-list localhost:9092,localhost:9093,localhost:9094 topic MultiBrokerTopic

24 ecosystem 24

25 Connect Connect Kafka Connectors source & sink Data source Kafka Data sink Console File JDBC ElasticSearch Hdfs S3 dynamodb 25

26 Kafka Connectors CONNECTOR TYPE CONNECTOR TYPE ElasticSearch sink HDFS sink Amazon S3 sink Cassandra sink Oracle CDC source Mongo DB source MQTT source JMS sink Couchbase sink & source Dynamo DB sink & source IBM MQ sink & source JDBC sink & source Blockchain source Amazon Kinesis sink CoAP source Azure DocumentDB sink Splunk sink & source Solr sink & source 26

27 Process of Kafka stream processing (API, KSQL) Create a STREAM/TABLE from Kafka topic with KSQL 27

28 Create KStream, KTable from Topic KTable as changelog stream stream-table duality - Stream as Table: stream as changelog of a table, aggregating stream data return a table -Table as Stream: A table can as a stream snapshot (key, value) records Sum of values As KStream Sum of values As KTable ( kafka, 1) ( kafka, 2)

29 Kafka Streams supports three kinds of joins 29

30 Operations on KStream & KTable Tumbling window vs Hopping window RocksDB or In-memory Store Type are internal compacted changelog topics 30

31 State in Cluster Stream processing 31

32 event store reconstruct the *original table from the changelog stream Don t use log compaction with KStreams! Breaks event store 32

33 Kappa architecture with Kafka Streams & Kafka Connect job n Output_table n Data source Input_topic Stream processing Output_table n+1 job n+1 Speed layer (core+streams) Serving layer (connect) 33

34 Publish & subscribe Read and write streams of data like a messaging system Process Write scalable stream processing applications that react to events in real-time Store Store streams of data safely in a distributed, replicated, fault-tolerant cluster 34

35 Let s start getting hands dirty

36 Create/List Topics Create a topic > bin/kafka-topics.sh --create --zookeeper localhost: replication-factor 1 -- partitions 1 --topic test List down all topics > bin/kafka-topics.sh --list --zookeeper localhost:2181 Output: test

37 Producer Send some Messages > bin/kafka-console-producer.sh --broker-list localhost: topic test Now type on console: This is a message This is another message

38 Consumer Receive some Messages > bin/kafka-console-consumer.sh --bootstrap-server localhost: topic test --from-beginning This is a message This is another message

39 Cluster > cp config/server.properties config/server-93.properties broker.id=93 listeners=plaintext://:9093 log.dir=/tmp/kafka-logs-93 Now Start another Kafka Server create topic with replication factor 2 (=# brokers) bin/kafka-server-start.sh config/server-93.properties bin/kafka-topics.sh create zookeeper localhost:2181 replication-factor 2 partitions 1 topic MultiBrokerTopic bin/kafka-topics.sh describe zookeeper localhost:2181 topic MultiBrokerTopic bin/kafka-console-producer.sh broker-list localhost:9092,localhost:9093 topic MultiBrokerTopic bin/kafka-console-consumer.sh bootstrap-server localhost:9092,localhost:9093 frombeginning topic MultiBrokerTopic Kill Leader, Broker switch from ID 93 to ID 0 39

40 Connect connect-file-sink.properties file=test.txt topic=connect-test connect-file-source.properties file=test.sink.txt topics=connect-test echo -e hello\nworld > test.txt > bin/connect-standalone.sh config/connect-file-source.properties config/connect-file-sink.properties more test.sink.txt > bin/kafka-console-consumer.sh --bootstrap-server localhost: topic connect-test --from-beginning { schema :{ type : string, optional :false}, payload : hello } { schema :{ type : string, optional :false}, payload : world } 40

41 Uses Cases for Apache Kafka (Confluent) 41

42 Confluent Platform: open source & commercial 42

43 Resume Kafka Best of both worlds: distributed, highly scalable messaging & streaming Extendable platform with lots of connectors, supported programming languages Stream processing is a fast growing topic with promising solutions Lack of standards Basic authorization, security mechanism Productions challenges (e.g. monitoring, debugging, sizing in the cloud, containers etc.) Growing experience and best-practicies Professional support Managed cloud solutions 43

44 Further info's The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive- Scale, Unbounded, Out-of-Order Data Processing, Tyler Akidau et al., VLDB

45 More questions? 45

46 Kontakt Materna GmbH Frank Pientka Tel

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference