Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Size: px

Start display at page:

Download "Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent"

Kristopher Foster
6 years ago
Views:

1 Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent

2 Agenda Why people use Ka1a Technical overview of Ka1a What s coming

3 What s Apache Ka1a Distributed, high throughput pub/sub system

4 Ka1a Usage

5 Common PaIern in Data- driven Companies User Value Product Virality Insights Data Signals Science 5

6 GeLng Data Is Hard Data scien)sts spent 70% )me on gelng and cleaning data Why?

7 Variety in Data Sources Database records Users, products, orders Business metrics Clicks, impressions, pageviews Opera)onal metrics CPU usage, requests/sec Applica)on logs Service calls, errors IOT

8 Variety in New Specialized Systems Batch oriented Hadoop ecosystem (Pig, Hive, Spark, ) Real )me Key/value store (Cassandra, MongoDB, HBase, ) Search (Elas)cSearch, Solr, ) Stream processing (Storm, Spark streaming, Samza) Graph (GraphLab, FlockDB, ) Time series DB (open TSDB, )

9 Danger of Point- to- point Pipelines Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Operational Logs Operational Metrics Hadoop Log Search Monitoring Data Warehouse Social Graph Rec. Engine Search Security... Production Services

10 Ideal Architecture: Stream Data Plaborm Espresso Espresso Espresso Voldemort Voldemort Voldemort Oracle Oracle Oracle User Tracking Operational Logs Operational Metrics Log Hadoop Log Search Monitoring Data Warehouse Social Graph Rec Engine Search Security... Production Services Ka1a is the center of stream data plaborm!

11 Ka1a at LinkedIn 800 billion messages, 175 TB data per day 13 million messages wriien/sec 65 million messages read/sec Tens of thousands of producers Thousands of consumers

12 Agenda Why people use Ka1a Technical overview of Ka1a Throughput and scalability Real )me and batch consump)on Durability and availability What s coming

13 Producer (Java) Concepts and API Topic defines a message queue byte[] key = "key".getbytes(); byte[] value = "value".getbytes(); record = new ProducerRecord("my-topic", key, value); producer.send(record); Consumer (Java) streams[] = Consumer.createMessageStreams("topic1", 1); for(message: streams[0]) { // do something with message }

14 Distributed Architecture producer producer producer producer producer producer producer producer producer kafka cluster consumer consumer consumer consumer consumer consumer consumer consumer consumer Topics are par))oned for parallelism 14

15 Built- in Cluster Management Automated failure detec)on and failover Leverage Apache Zookeeper Online data movement

Simple Efficient Storage with a Log Data Source writes Log 0 1 2 3 4 5 6 7 8 9 10

16 Simple Efficient Storage with a Log Data Source writes Log reads reads Destination System A (time = 7) Destination System B (time = 11) 16

17 Batching and Compression Batching Producer, broker, consumer Compression Gzip, snappy, lz4 End- to- end

18 Real Time and Batch Consump)on Mul)- subscrip)on model Data persisted on disk, NOT cached in JVM Rely on pagecache Zero- copy transfer (broker - > consumer) Ordered consump)on per topic par))on Significantly less bookkeeping

19 Durability and Availability Built- in replica)on Configurable replica)on factor Tolera)ng f 1 failures with f replicas Automated failover

20 Replicas and Layout Topic par))on has replicas Replicas spread evenly among brokers logs logs logs logs topic1-part1 topic1-part2 topic2-part1 topic2-part2 topic2-part2 topic1-part1 topic1-part2 topic2-part1 topic2-part1 topic2-part2 topic1-part1 topic1-part2 broker 1 broker 2 broker 3 broker 4

Data Flow in Replica)on producer ack 3 1 leader 2 follower 2 follower 4 consumer commit topic1-part1 topic1-part1 topic1-part1 broker 1 broker 2 broker 3 When producer receives

21 Data Flow in Replica)on producer ack 3 1 leader 2 follower 2 follower 4 consumer commit topic1-part1 topic1-part1 topic1-part1 broker 1 broker 2 broker 3 When producer receives ack Latency Durability on failures no ack no network delay some data loss wait for leader 1 network roundtrip a few data loss wait for commiied 2 network roundtrips no data loss

follower topic2-part1 producer follower follower leader topic3-part1

22 Extend to Mul)ple Par))ons producer leader topic1-part1 producer follower topic1-part1 follower topic1-part1 leader topic2-part1 follower topic2-part1 follower topic2-part1 producer follower follower leader topic3-part1 topic3-part1 topic3-part1 broker 1 broker 2 broker 3 broker 4 Leaders are evenly spread among brokers

23 Agenda Why people use Ka1a Technical overview of Ka1a What s coming

24 Stream Data Plaborm

25 Future Releases of Apache Ka1a New java consumer client BeIer performance Easier protocol for non- java client Security Authen)ca)on: SSL and Kerberos Authoriza)on BeIer cluster management More automated tools Quotas Transac)onal support Exactly once delivery

26 Confluent Mission: Make stream data plaborm a reality Ka1a development and support Product: Metadata management (released) Rest endpoint (released) Connectors for common systems Monitor data flow end- to- end Stream processing integra)on

27 Q&A More info on Apache Ka1a hip://ka1a.apache.org/ Confluent hip://confluent.io hip://confluent.io/careers Ka1a meetup tonight at 6:30 (Texas V)

Intra-cluster Replication for Apache Kafka. Jun Rao

Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture