Scaling the Yelp s logging pipeline with Apache Kafka. Enrico

Size: px

Start display at page:

Download "Scaling the Yelp s logging pipeline with Apache Kafka. Enrico"

Harvey Page
5 years ago
Views:

1 Scaling the Yelp s logging pipeline with Apache Kafka Enrico Canzonieri

2 Yelp s Mission Connecting people with great local businesses.

3 Yelp Stats As of Q M 102M 70% 32

4 What to expect High-level architecture overview Kafka best-practices No code

5 Logging? What is that?

6 A log is a stream of events

7 Logs provide valuable operational and business visibility.

8 Producing and consuming logs should be easy

9 Logs at scale

10 Consuming logs becomes hard Solution: Centralized logging!

11 Simple aggregation strategy: upload all logs to a centralized datastore Pros: Easy to implement (cron, logrotate, etc) Unique place to access logs Cons: No real time streaming Colocation not aggregation Amazon S3

12 Log aggregation systems

13 Scribe Push based architecture

14 Yelp s logging pipeline V1.0 Yelp Server Web app Log aggregator Scribe Leaf Yelp Server S3 uploader Scribe aggregator xinetd tail -follow=name /nail/scribe/<category>/current Web app Scribe Leaf Amazon S3 consumer consumer

15 Why use Scribe? Simple to configure Multilang support (via Thrift) Stable (most of the times) and fast Support arbitrary message size

16 Why move away from Scribe? No support for log consumption Doesn t really scale No replication Abandoned project Lack of plugins

17 Stream processing: Statmonster

18 Stream processing: Statmonster Statmonster is a real-time metrics pipeline The first stage extract multiple metrics from each line Consumes most of the high volume logs

19 Stream processing: Statmonster In 2013 the metrics from high volume logs started falling behind The first stage of the pipeline was too slow to keep up with the increased volume of logs

20 Stream processing: Statmonster Short term solution: sample logs to save timeliness Long term solution: run Statmonster on Apache Storm.

21 Stream processing: Statmonster Short term solution: sample logs to save timeliness Long term solution: run Statmonster on Apache Storm.

22 The real bottleneck Can t scale the consumer without increasing the number of physical log aggregators Aggregators tail -f /nail/scribe/access tail -f /nail/scribe/access tail -f /nail/scribe/access Statmonster Statmonster Statmonster

23 Need to scale the source!

24 Kafka into the mix High-throughput Publish-subscribe messaging system Distributed, replicated commit log Client library available for a variety of languages Configurable retention

25 Kafka Message (topic, partition, offset) Message Topic Partition Partition Partition Old New

26 Kafka Producer Java producer is async by default Messages are produced in batch Support either at-least-once or atmost-once semantic Support compression out of the box

27 Kafka Consumer Identified by a group id Consumers coordinate to consume from different partitions Ability to re-play messages Native support for offset commit

28 Architecture overview

29 Yelp s logging pipeline V1.5 Amazon S3 Scribe Leaf Sekretar Scribe aggregator Scribe Leaf Kafka consumer Legacy consumer Legacy consumer Kafka consumer

30 Sekretar Act as a Scribe aggregator speaking the Scribe protocol Produce messages to Kafka Map a log category into a topic Act as an intermediate buffer Handle big messages

31 Statmonster 2.0 access partition 0 Statmonster Kafka consumer access partition 1 Statmonster Kafka consumer access partition N Statmonster Kafka consumer

32 Yelp s logging pipeline V2.0 Amazon S3 Scribe Leaf Sekretar Scribe Leaf Legacy consumer Legacy consumer Scribe Kafka Reader Kafka consumer

33 Multi-regional architecture Data center 1 Data center 2 Aggregate logs consumer Sekretar Kafka consumer Kafka consumer Sekretar Amazon S3 Leaf Leaf

34 Configuration

35 Availability vs Consistency Cluster side: Unclean leader election Replication factor Disabled unclean.leader.election.enable 3: 1 leader + 2 followers default.replication.factor Minimum in-sync replicas (ISR) Quorum: 3/2 + 1 = 2 min. insync.replicas Producer side: All (or -1): all replicas in the ISR need to ack acks acks

36 Provisioning partitions How many partitions for a topic? Producer Topic 1 Partition 1 Consumer Producer Partition 2 Consumer Group 1 Partition 3 Consumer Producer Producer Partition 1 Partition 2 Consumer Consumer Group 2 Topic 2

37 Criteria 1: consumer Consumer speed Number of different consumer groups

38 Criteria 2: Kafka Physical Limits of a single broker egress/ingress Log retention Recovery speed upon broker failures

39 Criteria 3: producer Ingress rate into Kafka in msg/s and bytes/s

40 Accounting for spikes Message rate (msg/s) percentiles: 50, 75, 90, 95, 99, 99.5

41 Partitions auto-creation Regular traffic increase Automatically increase the number of partitions based on 99th percentile of bytes/s and msg/s over a time window of 3 days. Use a combined threshold of 700 msg/s and 500 KiB/s.

42 Tooling

43 Kafka Manager

44 Kafka Utils Contains several command line tools to help operating and maintaining a Kafka cluster. Cluster rolling restart Consumer offset management Cluster rebalance and broker decommission Healthchecks

45 Monitoring

46 Monitoring Sekretar Ingress: msg/s, bytes/s Egress: msg/s, (compressed) bytes/s Message delay Producer latency Producer memory buffer Timed-out message count

47 Monitoring Sekretar Unhealthy Kafka cluster

48 Kafka monitoring Kafka exports metrics via jmx ( io/1.0/kafka/monitoring.html): Offline partitions Under replicated partitions Controller count Kafka-Utils includes a check for min.isr

49 Consumer monitoring Consumer speed (msg/s) Kafka speed Consumer lag (msg)

50 Consumer monitoring Consumer slow down or logs volume increase

51 Apache Yelp 18 clusters ~ 100 brokers > 20K topics ~ 45K partitions > 5TB of data per day ~ 25 Billion messages per day

52 engineeringblog.yelp.com github.com/yelp

Intra-cluster Replication for Apache Kafka. Jun Rao

Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture