Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Size: px

Start display at page:

Download "Processing 11 billions events a day with Spark. Alexander Krasheninnikov"

Richard Townsend
5 years ago
Views:

1 Processing 11 billions events a day with Spark Alexander Krasheninnikov

2 Badoo facts 46 languages 10M Photos added daily 320M registered users 190 countries 21M daily active users servers 2 data-centers 350M messages daily

3 What is this talk about? Necessity of measuring product s metrics What tools you can use for it Our solution s design for this purpose

4 Why to measure?

5 IF YOU WANT TO IMPROVE SOMETHING, YOU OUGHT TO MEASURE IT FIRST one good engineering principle

6 Product metrics Registrations UGC User actions (likes, messages, votes) UX (interaction with service) etc.

7 Processing specialty

8 Large sliding windows Aggregation of data within 1 hour or 1 day millions of events

9 Heterogeneous events Di=erent events has their own set of attributes and aggregation instructions

10 Di=erent dimensions As consequence aggregation is performed on di=erent grouping?elds

11 Reaction needed ASAP Event s aggregation results should be available in reasonable time (every 2-5 minutes)

12 Our?rst approach

13 StatsCollector In-house MySQL-based event processing system Each event has it s own host + table with appropriate column set Sharding by time periods/column value Aggregation support (increment-only counter)

14 System Daws

15 System Daws Complex sharding based on event rate and target host load

16 System Daws Complex sharding based on event rate and target host load No complex types Dat events with lot s of columns

17 System Daws Complex sharding based on event rate and target host load No complex types Dat events with lot s of columns Aggregation Zoo each event if analyzed and visualized with it s own logic and code

18 System Daws Complex sharding based on event rate and target host load No complex types Dat events with lot s of columns Aggregation Zoo each event if analyzed and visualized with it s own logic and code Lack of centralized event de?nition language for events

19 We need to go deeper...

20 Requirements Distributed processing scalability, reliability, fault-tolerance Single codebase for events aggregation write less code to collect stats, do more Formalized de?nition of events

21 Hadoop Distributed storage (HDFS) built-in replication Resource Management and work control (YARN) fault-tolerance, tasks restart Support for programming (Map/Reduce) Has built-in tools for SQL-like aggregations (Hive)

22 Event delivery to Hadoop Aggregation Framework (1 instance) Application hosts (>2k) AGG hosts (2 per DC)

23 Event delivery to Hadoop Produce event on application host Write it into?le on HDD Forward to several agg hosts (logrotate style aggregation) Compress Upload to HDFS

24 Event de?nition

25 Event De?nition Language (EDL) Google Protobuf-based de?nition Formalized structure and aggregation instructions Code generation for producing events Glossary of all events

26 Event de?nition example

27 Let s start aggregation!

28 Naive solution - Hive Popular engine for SQL operations over data in Hadoop Our event s aggregation instructions can be exposed with SQL! Got stuck on 1d window and 1B of single-type event (>15 minutes of processing)

29 Complex solution - Spark Map/Reduce Framework on top of Hadoop Several times faster than Hive on some operations Key bene?ts for us: Possibility to program aggregation rules Processing event stream

30 System overview - Spark Spark Streaming Each N minutes?nd new?les in HDFS directory For each event perform aggregations Save intermediate result into HDDS

31 System overview - agg Badoo aggregation framework Find aggregation instructions for event Extract?elds for aggregation Expand event (GROUP BY GROUPING SETS SQL analog)

32 Aggregation overview

33 Event expand (map)

34 Event aggregation (reduce)

35 Code (Java API) // API entrypoint Spark context context = new JavaStreamingContext( "yarn-client", // using YARN cluster "Streaming", // name of the application Durations.seconds(30) // batch size ); // monitor HDFS directory for new files lines = context.textfilestream("hdfs://streaming/"); // parse each line into event // flatmap transforms 1 input row into [0,N] new elements parsedevents = lines.flatmap(new BadooParseFunction());

36 Code (Java API) // API entrypoint Spark context context = new JavaStreamingContext( "yarn-client", // using YARN cluster "Streaming", // name of the application Durations.seconds(30) // batch size ); // monitor HDFS directory for new files lines = context.textfilestream("hdfs://streaming/"); // parse each line into event // flatmap transforms 1 input row into [0,N] new elements parsedevents = lines.flatmap(new BadooParseFunction());

37 Code (Java API) // API entrypoint Spark context context = new JavaStreamingContext( "yarn-client", // using YARN cluster "Streaming", // name of the application Durations.seconds(30) // batch size ); // monitor HDFS directory for new files lines = context.textfilestream("hdfs://streaming/"); // parse each line into event // flatmap transforms 1 input row into [0,N] new elements parsedevents = lines.flatmap(new BadooParseFunction());

38 Code (Java API) // each event has N cubes of aggregation expandedevents = parsedevents.flatmaptopair(new ExpandFunction()); // aggregate data into 2 minutes intervals windowsize = Durations.minutes(2); windowaggregates = expandedevents.reducebykeyandwindow( new EventReduceFunction(), windowsize, windowsize);

39 Code (Java API) // each event has N cubes of aggregation expandedevents = parsedevents.flatmaptopair(new ExpandFunction()); // aggregate data into 2 minutes intervals windowsize = Durations.minutes(2); windowaggregates = expandedevents.reducebykeyandwindow( new EventReduceFunction(), windowsize, windowsize);

40 Code (Java API) // save 2 minute data as serialized objects windowaggregates.foreach((metric, time) -> { metric.saveasobjectfile("hdfs://output/" + formattime(time)); }); // start computation process context.start(); context.awaittermination();

41 Supported agg functions COUNT MIN/MAX/AVG SUM COUNT DISTINCT (using HyperLogLog) PERCENTILES (using QDigest)

42 Large windows aggregation / 00 02/ 02 04/ 1 hour 58 00/

43 Large windows aggregation 1) Take N of 2 minutes aggregation results from HDFS 2) Perform reducebykey operation 3) Save in format, suitable for timeseries DB, or reporting (we use JSON) 4) Called Divide and conquer principle 5) Serialized agg results consumes times less memory than raw data

44 Maintenance

45 Recovery Re-start using Spark checkpoint mechanism If application goes down, it is restarted, using metadata snapshot, stored in HDFS Metadata: list of processed?les, time of last computation

46 Monitoring Heartbeats, using separate stream processing Event rate is pre-de?ned and near-constant

47 Tools

48 Backstage - debug Tail/grep utility (web-socket servers on agg hosts)

49 Backstage SQL access Each event type has it s own directory in HDFS A Hive table is de?ned over each event type Presto (Facebook Hadoop SQL engine) is used for interactive querying events Hive is used for ETL batch jobs

50 Let s summarize

51 Facts Event stream RPS: >190 K/sec Over 350 di=erent event types 133K metrics/per sec - aggregation result 1TB of GZIP ed raw events a day 100 cores and 200 GB of memory for stream processing needs

52 Summary Stream processing of heterogeneous events is possible! But it need some coding map/reduce aggregation di=ers from SQL Near-realtime processing can be boosted with divide and conquer principle

53 Links techblog.badoo.com our archive of di=erent articles spark.apache.org Apache Spark project homepage github.com/twitter/algebird Twitter s library with algebra, applicable for map/reduce aggregations (HyperLogLog, QDigest)

54 Thank you!

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes