Fast Data apps with Alpakka Kafka connector and Akka Streams. Sean Glover,

Size: px

Start display at page:

Download "Fast Data apps with Alpakka Kafka connector and Akka Streams. Sean Glover,"

Margaret Washington
5 years ago
Views:

1 Fast Data apps with Alpakka Kafka connector and Akka Streams Sean Glover,

Contributor to various projects in the Kafka

2 Who am I? I m Sean Glover Principal Engineer at Lightbend Member of the Fast Data Platform team Organizer of Scala Toronto (scalator) Contributor to various projects in the Kafka ecosystem including Kafka, Alpakka Kafka (reactive-kafka), Strimzi, DC/OS Commons SDK / seg1o 2

3 The Alpakka project is an initiative to implement a library of integration modules to build stream-aware, reactive, pipelines for Java and Scala. 3

4 JMS Cloud Services Data Stores Messaging 4

5 kafka connector This Alpakka Kafka connector lets you connect Apache Kafka to Akka Streams. It was formerly known as Akka Streams Kafka and even Reactive Kafka. 5

6 Top Alpakka Modules Alpakka Module Downloads in August 2018 Kafka Cassandra AWS S MQTT File Simple Codecs 8285 CSV 7428 AWS SQS 5385 AMQP

7 streams Akka Streams is a library toolkit to provide low latency complex event processing streaming semantics using the Reactive Streams specification implemented internally with an Akka actor system. 7

8 streams User Messages (flow downstream) Outlet Source Flow Sink Inlet Internal Back-pressure Messages (flow upstream) 8

9 Reactive Streams Specification Reactive Streams is an initiative to provide a standard for asynchronous stream processing with non-blocking back pressure. 9

10 Reactive Streams Libraries streams migrating to Spec now part of JDK 9 java.util.concurrent.flow 10

11 Back-pressure Destination Kafka Topic Demand Demandrequest satisfiedis sent upstream downstream I need to load some messages for downstream Source Flow I need some... Key: EN, Value: { message : Bye Akka! } messages. Key: FR, Value: { message : Au revoir Akka! } Key: ES, Value: { message : Adiós Akka! }... Sink... Key: EN, Value: { message : Hi Akka! } Key: FR, Value: { message : Salut Akka! } Key: ES, Value: { message : Hola Akka! }... openclipart Source Kafka Topic 11

12 Dynamic Push Pull I can t send more messages downstream because I no more demand to fulfill. Source sends (push) request a batch Flow sends demand of(pull) 5 messages downstream of 5 messages max Source x I can handle 5 more messages Flow Slow Consumer Fast Producer Bounded Mailbox Flow s mailbox is full! openclipart 12

13 Akka Streams Factorial Example import... object Main extends App { implicit val system = ActorSystem("QuickStart") implicit val materializer = ActorMaterializer() val source: Source[Int, NotUsed] = Source(1 to 100) val factorials = source.scan(bigint(1))((acc, next) acc * next) val result: Future[IOResult] = factorials.map(num => ByteString(s"$num\n")).runWith(FileIO.toPath(Paths.get("factorials.txt"))) } 13

14 Kafka Kafka is a distributed streaming system. It s best suited to support fast, high volume, and fault tolerant, data streaming platforms. Kafka Documentation 14

15 When to use Alpakka Kafka? 1. To build back-pressure aware integrations 2. Complex Event Processing 3. A need to model the most complex of graphs 15

16 Alpakka Kafka Setup val consumerclientconfig = system.settings. config.getconfig( "akka.kafka.consumer") val consumersettings = ConsumerSettings(consumerClientConfig, new StringDeserializer, new ByteArrayDeserializer).withBootstrapServers( "localhost:9092").withgroupid( "group1") Alpakka Kafka config & Kafka Client config can go here.withproperty(consumerconfig. AUTO_OFFSET_RESET_CONFIG, "earliest") val producerclientconfig = system.settings. config.getconfig( "akka.kafka.producer") val producersettings = ProducerSettings(system, new StringSerializer, new ByteArraySerializer).withBootstrapServers( "localhost:9092") Set ad-hoc Kafka client config 16

17 Simple Consume, Transform, Produce Workflow Committable Source provides Kafka offset storage committing semantics val control = Consumer Kafka Consumer Subscription.committableSource(consumerSettings, Subscriptions. topics("topic1", "topic2")).map { msg => ProducerMessage. Message[String, Array[Byte], ConsumerMessage.CommittableOffset]( new ProducerRecord( "targettopic", msg.record.value), Transform and produce a new message with reference to offset of consumed message msg.committableoffset ) } Create ProducerMessage with reference to consumer offset it was processed from.tomat(producer. commitablesink(producersettings))(keep. both).mapmaterializedvalue(drainingcontrol. apply).run() Produce ProducerMessage and automatically commit the consumed message once it s been acknowledged // Add shutdown hook to respond to SIGTERM and gracefully shutdown stream sys.shutdownhookthread { Graceful shutdown on SIGTERM Await. result(control.shutdown(), 10.seconds) } 17

18 Consumer Groups

19 Why use Consumer Groups? 1. Easy, robust, and performant scaling of consumers to reduce consumer lag 19

20 Latency and Offset Lag Consumer Group Producer 1 Cluster Consumer 1 Producer 2... Topic Consumer 2 Producer n Consumer 3 Consumer Throughput ~3 MB/s each ~9 MB/s Total offset lag and latency is growing. Throughput: 10 MB/s Back Pressure openclipart 20

21 Latency and Offset Lag Consumer Group Producer 1 Cluster Consumer 1 Producer 2... Topic Consumer 2 Consumers now can support a throughput of ~12 MB/s Producer n Consumer 3 Offset lag and latency decreases until consumers are caught up Consumer 4 Data Throughput: 10 MB/s Add new consumer and rebalance 21

22 Anatomy of a Consumer Group Consumer Group Cluster Client A Important Consumer Group Client Config Consumer Group Leader Topic Subscription: Pa rtiti ons : 0, 1,2 Subscription.topics( Topic1, Topic2, Topic3 ) Kafka Consumer Properties: group.id: session.timeout.ms: partition.assignment.strategy: heartbeat.interval.ms: Client B [ my-group ] [30000 ms] [RangeAssignor] [3000 ms] Consumer Group Coordinator Partitions: 3,4,5 T1 T2,7,8 tit :6 ions Par T3 Client C Consumer Group Topic Subscription Consumer Group Offsets topic Ex) Consumer Offset Log P0: P1: P2: P3: P4: P5: P6: P7: P8:

23 Consumer Group Rebalance (1/7) Consumer Group Cluster Client A Consumer Group Leader Pa rtiti ons : 0, 1,2 Client B Consumer Group Coordinator Partitions: 3,4,5 T1 T2,7,8 tit :6 ions Par T3 Client C Consumer Offset Log 23

24 Consumer Group Rebalance (2/7) Consumer Group Cluster Client A Consumer Group Leader Pa rtiti ons : 0, 1,2 Client B Consumer Group Coordinator Partitions: 3,4,5 T1 T2,7,8 tit :6 ions Par T3 Client C Consumer Offset Log Client D New Client Client D requests D with to same join group.id the consumer sendsgroup a request to join the group to Coordinator 24

25 Consumer Group Rebalance (3/7) Consumer group coordinator requests group leader to calculate new Client:partition assignments. Consumer Group Cluster Client A Consumer Group Leader Pa rtiti ons : 0, 1,2 Client B Consumer Group Coordinator Partitions: 3,4,5 T1 T2,7,8 tit :6 ions Par T3 Client C Consumer Offset Log Client D 25

26 Consumer Group Rebalance (4/7) Consumer group leader sends new Client:Partition assignment to group coordinator. Consumer Group Cluster Client A Consumer Group Leader Pa rtiti ons : 0, 1,2 Client B Consumer Group Coordinator Partitions: 3,4,5 T1 T2,7,8 tit :6 ions Par T3 Client C Consumer Offset Log Client D 26

27 Consumer Group Rebalance (5/7) Consumer Group Cluster Client A Consumer Group Leader Assign Partitions: 0,1,3 ns: 2 rtitio n Pa ig Ass s ion rtit :6,7,8 Pa ns ign T1 T2 Client C T3 As sig n Pa rti s As,5 :4 tio Client B Consumer Group Coordinator Consumer Offset Log Client D Consumer group coordinator informs all clients of their new Client:Partition assignments. 27

28 Consumer Group Rebalance (6/7) Consumer Group Cluster Client A Consumer Group Coordinator Consumer Group Leader ns tio rti Pa to T1 m Pa rt om C Client B to Co mm i T3 t: 3 Client C Partitio T2 2 ns it: itio ns to Comm,5 it: 6,7,8 Consumer Offset Log Client D Clients that had partitions revoked are given the chance to commit their latest processed offsets. 28

29 Consumer Group Rebalance (7/7) Consumer Group Cluster Client A Consumer Group Leader Pa Consumer Group Coordinator rtiti ons : 0, 1 Client B Partitions: 2,3 T1 : ions artit P Client C ns T2 4,5 8,7, :6 T3 io rtit Pa Consumer Offset Log Client D Rebalance complete. Clients begin consuming partitions from their last committed offsets. 29

30 Commit on Consumer Group Rebalance val consumerclientconfig = system.settings. config.getconfig( "akka.kafka.consumer") val consumersettings = ConsumerSettings(consumerClientConfig, new StringDeserializer, new ByteArrayDeserializer).withGroupId( "group1") Declare a RebalanceListener Actor to handle assigned and revoked partitions class RebalanceListener extends Actor with ActorLogging { def receive: Receive = { case TopicPartitionsAssigned(sub, assigned) => case TopicPartitionsRevoked(sub, revoked) => commitprocessedmessages(revoked) Commit offsets for messages processed from revoked partitions } } val subscription = Subscriptions. topics("topic1", "topic2").withrebalancelistener(system.actorof( Props[RebalanceListener])) Assign RebalanceListener to topic subscription. val control = Consumer. committablesource(consumersettings, subscription)... 30

31 Transactional Exactly-Once

32 Kafka Transactions Transactions enable atomic writes to multiple Kafka topics and partitions. All of the messages included in the transaction will be successfully written or none of them will be. 32

33 Message Delivery Semantics At most once At least once Exactly once openclipart 33

34 Exactly Once Delivery vs Exactly Once Processing Exactly-once message delivery is impossible between two parties where failures of communication are possible. Two Generals/Byzantine Generals problem 34

35 Why use Transactions? 1. Zero tolerance for duplicate messages 2. Less boilerplate (deduping, client offset management) 35

36 Anatomy of Kafka Transactions Important Client Config Topic Subscription: Consume, Transform, Produce Client Subscription.topics( Topic1, Topic2, Topic3 ) Transformation Destination topic partitions get included in the transaction based on messages that are produced. Kafka Consumer Properties: group.id: my-group isolation.level: read_committed plus other relevant consumer group configuration Consumer Group Coordinator Transaction Coordinator Topic Sub Topic Dest CM UM UM CM UM UM Kafka Producer Properties: transactional.id: enable.idempotence: max.in.flight.requests.per.connection: my-transaction true (implicit) 1 (implicit) Control Messages Cluster Consumer Offset Log Transaction Log 36

37 Kafka Features That Enable Transactions 1. Idempotent producer 2. Multiple partition atomic writes 3. Consumer read isolation level 37

38 Idempotent Producer (1/5) Client KafkaProducer.send(k,v) sequence num = 0 producer id = 123 Cluster Broker Leader Partition Log 38

39 Idempotent Producer (2/5) Client Cluster Broker Append (k,v) to partition sequence num = 0 producer id = 123 Leader Partition (k,v) seq = 0 pid = 123 Log 39

40 Idempotent Producer (3/5) Client x Cluster Broker acknowledgement fails KafkaProducer.send(k,v) sequence num = 0 producer id = 123 Broker Leader Partition (k,v) seq = 0 pid = 123 Log 40

41 Idempotent Producer (4/5) Client (Client Retry) KafkaProducer.send(k,v) sequence num = 0 producer id = 123 Cluster Broker Leader Partition (k,v) seq = 0 pid = 123 Log 41

42 Idempotent Producer (5/5) Client Broker acknowledgement succeeds KafkaProducer.send(k,v) ack(duplicate) sequence num = 0 producer id = 123 Cluster Broker Leader Partition (k,v) seq = 0 pid = 123 Log 42

43 Multiple Partition Atomic Writes Ex) Second phase of two phase commit Client KafkaProducer.commitTransaction() Transaction and Consumer Group Coordinators Cluster CM CM Last Offset Processed for Consumer Subscription Transactions Log CM CM Transaction Committed (internal) CM CM CM CM Consumer Offset Log User Defined Partition 1 UM UM User Defined Partition 2 UM UM CM User Defined Partition 3 UM UM Transaction Committed control messages (user topics) Multiple Partitions Committed Atomically, All or nothing 43

44 Consumer Read Isolation Level CM CM Cluster User Defined Partition 1 UM UM User Defined Partition 2 Client UM UM CM User Defined Partition 3 UM UM Kafka Consumer Properties: isolation.level: read_committed 44

45 Transactional Pipeline Latency Client Client Client Transaction Batches every 100ms End-to-end Latency ~300ms 45

Alpakka Kafka Transactions Destination Kafka Partitions Cluster Messages waiting for ack before commit Transactional Source Transform... Key: EN, Value: { message : Bye Akka!

46 Alpakka Kafka Transactions Destination Kafka Partitions Cluster Messages waiting for ack before commit Transactional Source Transform... Key: EN, Value: { message : Bye Akka! } Key: FR, Value: { message : Au revoir Akka! } Key: ES, Value: { message : Adiós Akka! }... Transactional Sink... Key: EN, Value: { message : Hi Akka! } Key: FR, Value: { message : Salut Akka! } Key: ES, Value: { message : Hola Akka! }... Cluster openclipart Source Kafka Partition(s) akka.kafka.producer.eos-commit-interval = 100ms 46

47 Transactional GraphStage (1/7) Transactional GraphStage Transaction Status Transaction Begin Transaction Flow Back Pressure Status Resume Demand Waiting for ACK Commit Loop Waiting Mailbox 47

48 Transactional GraphStage (2/7) Transactional GraphStage Transaction Status Transaction Transaction is Open Flow Back Pressure Status Resume Demand Waiting for ACK Commit Loop Messages flowing Commit Interval Elapses Mailbox 48

49 Transactional GraphStage (3/7) Transactional GraphStage Transaction Status Transaction Transaction is Open Flow Back Pressure Status Resume Demand Waiting for ACK Commit Loop Messages flowing Commit Interval Elapses 100ms Mailbox Commit loop tick message 49

50 Transactional GraphStage (4/7) Transactional GraphStage Transaction Status Transaction Transaction is Open Flow x Back Pressure Status Suspend Demand Waiting for ACK Commit Loop Messages stopped Commit Interval Elapses Mailbox 50

51 Transactional GraphStage (5/7) Transactional GraphStage Transaction Status Transaction Send Consumed Offsets Flow x Back Pressure Status Suspend Demand Waiting for ACK Commit Loop Messages stopped Commit Interval Elapses Mailbox 51

52 Transactional GraphStage (6/7) Transactional GraphStage Transaction Status Transaction Commit Transaction Flow x Back Pressure Status Suspend Demand Waiting for ACK Commit Loop Messages stopped Commit Interval Elapses Mailbox 52

53 Transactional GraphStage (7/7) Transactional GraphStage Transaction Status Transaction Begin New Transaction Flow Back Pressure Status Resume Demand Waiting for ACK Commit Loop Waiting Messages flowing again Mailbox 53

54 Alpakka Kafka Transactions val producersettings = ProducerSettings(system, new StringSerializer, new ByteArraySerializer).withBootstrapServers( "localhost:9092").witheoscommitinterval( 100.millis) val control = Transactional.source(consumerSettings, Subscriptions. topics("source-topic")) Optionally provide a Transaction commit interval (default is 100ms) Use Transactional.sourceto propagate necessary info to Transactional.sink(CG ID, Offsets).via(transform).map { msg => ProducerMessage. Message(new ProducerRecord[ String, Array[Byte]]( "sink-topic", msg.record.value), msg.partitionoffset) } Call Transactional.sinkor.flow to produce and commit messages..to(transactional. sink(producersettings, "transactional-id")).run() 54

55 Complex Event Processing

56 What is Complex Event Processing (CEP)? Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances. Foundations of Complex Event Processing, Cornell 56

57 Calling into an Akka Actor System Akka Cluster/Actor System Actor System & JVM Actor System & JVM Actor System & JVM Cluster Router Actor Ask pattern models non-blocking request and response of Akka messages. Ask Cluster Source? Sink Cluster openclipart 57

58 Actor System Integration class ProblemSolverRouter extends Actor { Transform your stream by processing messages in an Actor System. All you need is an ActorRef. def receive = { case problem: Problem => val solution = businesslogic(problem) sender()! solution // reply to the ask } }... val control = Consumer.committableSource(consumerSettings, Subscriptions. topics("topic1", "topic2")).map(parseproblem).mapasync(parallelism = Use Ask pattern (? function) to call provided ActorRef to get an async response 5)(problem => ( problemsolverrouter? problem).mapto[solution]).map { solution => ProducerMessage. Message[String, Array[Byte], ConsumerMessage.CommittableOffset]( new ProducerRecord( "targettopic", solution.tobytes), solution.committableoffset) }.tomat(producer. commitablesink(producersettings))(keep. both).mapmaterializedvalue(drainingcontrol. apply) Parallelism used to limit how many messages in flight so we don t overwhelm mailbox of destination Actor and maintain stream back-pressure..run() 58

59 Persistent Stateful Stages

60 Options for implementing Stateful Streams 1. Provided Akka Streams stages: fold, scan, etc. 2. Custom GraphStage 3. Call into an Akka Actor System 60

61 Persistent Stateful Stages using Event Sourcing 1. Recover state after failure 2. Create an event log 3. Share state 61

62 Persistent GraphStage using Event Sourcing Stateful Stage Cluster Sink Source Cluster State Request Handler Request (Command/Query) Akka Persistence Plugins Event Handler Response (Event) Triggers State Change akka.persistence.journal Writes Reads (Replays) Event Log openclipart 62

63 krasserm / akka-stream-eventsourcing This project brings to Akka Streams what Akka Persistence brings to Akka Actors: persistence via event sourcing. Experimental Public Domain Vectors 63

64 New in Alpakka Kafka 1.0-M1

65 Alpakka Kafka 1.0M1 Release Notes Released Nov 6, Highlights: Upgraded the Kafka client to version #544 Support new API s from KIP-299: Fix Consumer indefinite blocking behaviour in #614 New Committer.sink for standardised committing #622 Commit with metadata #563 and #579 Factored out akka.kafka.testkit for internal and external use: see Testing Support for merging commit batches #584 Reduced risk of message loss for partitioned sources #589 Expose Kafka errors to stream #617 Java APIs for all settings classes #616 Much more comprehensive tests 65

66 Conclusion

67 kafka connector openclipart 67

68 Lightbend Fast Data Platform 68

69 Free ebook! Thank You! Sean in/seanaglover

Fast Data apps with Alpakka Kafka connector. Sean Glover,

Fast Data apps with Alpakka Kafka connector Sean Glover, Lightbend @seg1o Who am I? I m Sean Glover Senior Software Engineer at Lightbend Member of the Fast Data Platform team Organizer of Scala Toronto