ML pipelines at OK. Dmitry Bugaychenko

Size: px

Start display at page:

Download "ML pipelines at OK. Dmitry Bugaychenko"

Coral Banks
6 years ago
Views:

1 ML pipelines at OK Dmitry Bugaychenko

3 OK is monthly unique users

4 OK is family links in the social graph

5 OK is A place where people share their posi9ve feelings

OK is 10000+ servers around the globe 1+Tb/s of outgoing

6 OK is servers around the globe 1+Tb/s of outgoing traffic 400+ sodware components High Load, Big Data, Fault Tolerance

7 OK is 3 Hadoop clusters 30+ petabytes storage (+20Tb daily) cores 40+ TB RAM 300+ regular jobs

8 Data Mining in 3 Steps Get the data

9 Data Mining in 3 Steps Get the data Find unobvious dependencies

10 Data Mining in 3 Steps Get the data Find unobvious dependencies Use them to make decisions

11 Data Mining with Machine Learning Get the data Find unobvious dependencies By fi"ng parameters of a mathema9cal model Use them to make decisions

12 ML Tasks at OK Find the good stuff Recommender systems Find the bad stuff An9-spam systems Understand the users Explora9ve analy9cs

13 Smart Data Toolbox at OK

14 Spark ML Pipelines

15 Spark ML Pipelines Transformer

16 Spark ML Pipelines Transformer Estimator

17 Spark ML Pipelines

18 Spark ML Pipelines

19 Spark ML Pipelines

20 Meet the OK ML Pipelines Open source project: hyps://github.com/odnoklassniki/ok-ml-pipelines Extends Spark ML pipelines for Flexible dataflow organiza9on BeYer cluster u9liza9on ML scaling and opera9onalizing

21 Example task Churn predic9on Input: 10M users with 200+ ayributes Output: user s reten9on Method: Find unobvious dependencies Build a predic9on model Analyze the models weights Use them to make decisions Profit!

22 The first pipeline: Extract features

23 The first pipeline: Convert to vector

24 The first pipeline: Choose the ML method

25 Logistic Regression at a glance 1 1+ e ( a x+b )

26 Logistic Regression at a glance 1 1+ e ( a x+b )

27 Logistic Regression at a glance 1 1+ e ( a x+b )

28 The first pipeline: Train the model

29 The first pipeline: Get the predictions

30 The first pipeline: Extract Features

31 The first pipeline: Convert to vector

32 The first pipeline: Train the model

33 The most known histogram in ML world

34 Caching example: the simple way

35 Caching example: the simple way failure

36 OK ML Pipelines: Unwrapped Stage

37 OK ML Pipelines: Unwrapped Stage

38 More cases of this kind Cache Repar99oning Random sampling Ordered cut Persist data to temp folder

39 Back to the goal: exploring weights We do not need the model itself We need some insights about the churn Lets look at the weights

40 Back to the goal: exploring weights We do not need the model itself We need some insights about the churn Lets look at the weights

41 Back to the goal: exploring weights

42 But wait! Spark ML does NOT work this way!

43 Back to the goal: exploring weights We do not need the model itself We need some insights about the churn Lets look at the weights

44 OK ML Pipelines: Model Summary Collec9on of named DataFrame s OK ML es9mators produces summary Saved as parquet files

45 OK ML Pipelines: Model Summary Collec9on of named DataFrame s OK ML es9mators produces summary Saved as parquet files Spark ML es9mator might be wrapped to produce summary

46 Back to the goal: how good is the model? Does the model make sense? We need to cross-validate the quality

47 Cross-validation

48 OK ML Pipelines: Evaluation

49 OK ML Pipelines: Evaluation

50 OK ML Pipelines: Evaluation

51 OK ML Pipelines: Evaluation

52 Evaluation: Spark ML vs. OK ML Spark ML Evaluator, producing a single scalar Used for hyper parameters tuning OK ML Pipelines Evaluator is a transformer conver9ng predic9ons to metrics Injected using unwrapped stage Metrics block is added to the model summary

53 OK ML Pipelines: Evaluation Binary classifica9on evaluator ROC, RP-curve, F1-plot Par99oned ranking evaluator NDCG, AUC, Post processing evaluator collects stat for metrics Train/test evaluator Cross-valida9on

54 Some thoughts on cross-validation 10+1 models No shared mutable state 90% intersec9on on input

55 Some thoughts on cross-validation 10+1 models No shared mutable state 90% intersec9on on input Single trainer u9lizes less than 100 cores cores in the cluster

56 Spark ML CrossValidator Sequen9al Cache each fold separately

57 Spark ML CrossValidator Sequen9al Cache each fold separately

58 OK ML CrossValidator Cache shared source data Train folds in parallel

59 OK ML Pipelines: Forked Estimator

60 OK ML Pipelines: Forked Estimator Model segmenta9on Mul9-class classifica9on Feature selec9on Cross valida9on Grid search work in progress

61 Cherry on the pie: Features scaling

62 Spark ML Pipelines: Scaling Two models to move to prod Two set of weights to keep

63 Spark ML Pipelines: Scaling Two models to move to prod Two set of weights to keep

64 Spark ML Pipelines: Scaling x i x i sd(x i )

65 Spark ML Pipelines: Scaling x i x i sd(x i ) 1 ( ) 1+ e a x+b

66 OK ML Pipelines: inline scaling aʹ = i a i sd(x i ) ;b' = b ʹ a x

67 OK ML Pipelines: Model Transformer

68 OK ML Pipelines: Model Transformer

69 OK ML Pipelines: Model Transformer

70 OK ML Pipelines: inline scaling

71 Even more cool stuff NLP transformers: Language detec9on Language aware analyzer URL eliminator Vectorizer N-gram extractor Trending term detec9on LDA work in progress Stat u9ls Vector stat collector Extended online summarizer Learning algorithms Matrix LBFGS DSRVGD CRR

72 How to use it Spark 2.2 (sbt): librarydependencies += "ru.odnoklassniki" %% "ok-ml-pipelines" % "0.1-spark2.2" withsources() Spark 1.6 (sbt): librarydependencies += "ru.odnoklassniki" %% "ok-ml-pipelines" % "0.1-spark1.6" withsources() Sources: hyps://github.com/odnoklassniki/ok-ml-pipelines

73 Thank you for your attention!

From click to predict and back: ML pipelines at OK. Dmitry Bugaychenko

From click to predict and back: ML pipelines at OK Dmitry Bugaychenko OK is 70 000 000+ monthly unique users OK is 800 000 000+ family links in the social graph OK is A place where people share their posi9ve