Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Size: px

Start display at page:

Download "Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch"

Valentine Brooks
6 years ago
Views:

1 Nick Pentreath Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

2 Principal Engineer, IBM Apache Spark PMC Focused on machine learning Author of Machine Learning with Spark

3 Agenda Recommender systems & the machine learning workflow Data modelling for recommender systems Why Spark, Kafka & Elasticsearch? Kafka & Spark Streaming Spark ML for collaborative filtering Deploying & scoring recommender models with Elasticsearch Monitoring, feedback & re-training Scaling model serving Demo

4 Recommender Systems & the ML Workflow

5 Recommender Systems Overview

6 The Machine Perception Learning Workflow Data??? Machine Learning??? $$$

The Machine Reality Learning Workflow Spark

7 The Machine Reality Learning Workflow Spark ML Missing piece Various Spark DataFrames??? Data Ingest Data Processing Model Training Deploy Live System Historical Streaming Feature transformation & engineering Model selection & evaluation Pipelines, not just models Versioning Predict given new data Monitoring & live evaluation Feedback Loop Stream (Kafka)

Metadata Events Aggregation Handle implicit data ALS Ranking-style evaluation Model size &

8 The Machine Recommender Version Learning Workflow Spark ML Elasticsearch Spark DataFrames Elasticsearch Data Ingest Data Processing Model Training Deploy Live System User & Item Metadata Events Aggregation Handle implicit data ALS Ranking-style evaluation Model size & complexity User & item recommendations Monitoring, filters Feedback => another Event Type Stream (Kafka)

9 Data Modeling for Recommender Systems

10 User and Item Data model Metadata

11 User and Item System Requirements Metadata Filtering & Grouping Business Rules

Anatomy of a User Interactions User Event

12 Anatomy of a User Interactions User Event User interactions Implicit preference data Intent data Page view ecommerce - cart, purchase Media preview, watch, listen Search query Explicit preference data Social network interactions Rating Review Like Share Follow

13 Anatomy of a Data model User Event

14 Anatomy of a How to handle implicit feedback? User Event

15 Why Kafka, Spark & Elasticsearch?

16 Why Kafka? Scalability De facto standard for a centralized enterprise message / event queue Integration Integrates with just about every storage & processing system Good Spark Streaming integration 1 st class citizen Including for Structured Streaming (but still very new & rough)

17 Why Spark? DataFrames Events & metadata are lightly structured data Suited to DataFrames Pluggable external data source support Spark ML Spark ML pipelines including scalable ALS model for collaborative filtering Implicit feedback & NMF in ALS Cross-validation Custom transformers & algorithms

18 Why Elasticsearch? Storage Native JSON Scalable Good support for time-series / event data Kibana for data visualisation Integration with Spark DataFrames Scoring Full-text search Filtering Aggregations (grouping) Search ~== recommendation (more later)

19 Kafka for Recommender Systems

20 Event Data Pipeline User analytics & aggregation Kafka Spark Streaming Event store Dashboards Item analytics & aggregation

21 Write to Event Store Event store Spark Streaming

22 Kibana Dashboards Spark Streaming Dashboards

23 Item Metadata Analytics Spark Streaming Aggregated activity metrics Item analytics & aggregation

24 User Metadata Analytics User analytics & aggregation Spark Streaming Aggregated activity metrics & item exclusions

25 Structured Status Streaming Still early days Initial Kafka support in Spark No ES support yet not clear if it will be a full-blown datasource or ForeachWriter For now, you can create a custom ForeachWriter for your needs

26 Spark ML for Collaborative Filtering

27 Collaborative Matrix Factorization Filtering

Collaborative Prediction Filtering 3 1.1 3.2 4.3 1 0.2 1.4 3.1 0.2 1.7 2.3 0.

28 Collaborative Prediction Filtering

29 Collaborative Loading Data in Spark ML Filtering

30 Alternating Least Implicit Preference Data Squares

31 Deploying & Scoring Recommendation Models

32 Prelude: Search Analysis cat videos Full-text Search & Similarity Scoring Term vectors cat 0 1 videos 1 Ranking 0 Similarity Sort results

2 User (or item) vector Dot product & cosine similarity the

33 Recommendation Can we use the same machinery?? Analysis Term vectors Scoring Ranking User (or item) vector Dot product & cosine similarity the same as we need for recommendations 0.3 Similarity Sort results

34 Elasticsearch Delimited Payload Filter Term Vectors Raw vector Custom analyzer Term vector with payloads

Elasticsearch Custom scoring function Scoring Native script (Java), compiled for speed Scoring function computes dot product by: For each document

35 Elasticsearch Custom scoring function Scoring Native script (Java), compiled for speed Scoring function computes dot product by: For each document vector index ( term ), retrieve payload score += payload * query(i) Normalizes with query vector norm and document vector norm for cosine similarity

36 Recommendation Can we use the same machinery? Analysis Term vectors Scoring Ranking User (or item) vector Delimited payload filter 0.3 Custom scoring function Sort results

37 Elasticsearch We get search engine functionality for free Scoring

38 Alternating Least Deploying to Elasticsearch Squares

39 Monitoring & Feedback

40 System Events Logging Recommendations Served

41 System Events Logging Recommendation Actions

42 Tracking Performance Performance monitoring & alerts Event store Kafka Spark Streaming Dashboards Impression capping / fatigue

43 Scaling Model Scoring

44 Scoring Performance k=20 k=50 k=100 Scoring time per query, by factor dimension & number of items Time (ms) ,000 1,000,000 Size of item set *3x nodes, 30x shards

45 Scoring Increasing number of shards Performance Time (ms) Scoring time per query, by number of shards & number of items 10 shards 30 shards 60 shards 90 shards 100,000 1,000,000 Size of item set *3x nodes, k=50

46 Scoring Locality Sensitive Hashing Performance LSH hashes each input vector into L hash tables. Each table contains a hash signature created by applying k hash functions. Standard for cosine similarity is Sign Random Projections At indexing time, create a bucket by combining hash table id and hash signature Store buckets as part of item model metadata At scoring time, filter candidate set using term filter on buckets of query item Tune LSH parameters to trade off speed / accuracy LSH coming soon to Spark ML SPARK-5992

47 Scoring Locality Sensitive Hashing Performance 250 Scoring time per query - brute force vs LSH 200 Time (ms) Brute force LSH *3x nodes, 30x shards, k=50, 1,000,000 items

48 Scoring Comparison to score then search Performance 250 Score Sort Search Scoring time per query LSH vs score-then-search 200 Time (ms) Brute force LSH Score-then-search *3x nodes, 30x shards, k=50, 1,000,000 items

49 Demo

50 Future Work

51 Future Work Apache Solr version of scoring plugin (any takers?) Investigate ways to improve Elasticsearch scoring performance Performance for LSH-filtered scoring should be better Can we dig deep into ES scoring internals to combine efficiency of matrix-vector math with ES search & filter capabilities? Investigate more complex models Factorization machines & other contextual recommender models Scoring performance Spark Structured Streaming with Kafka, Elasticsearch & Kibana Continuous recommender application including data, model training, analytics & monitoring

52 References Elasticsearch Elasticsearch Spark Integration Spark ML ALS for Collaborative Filtering Collaborative Filtering for Implicit Feedback Datasets Factorization Machines Elasticsearch Term Vectors & Payloads Delimited Payload Filter Vector Scoring Plugin Kafka & Spark Streaming Kibana

53 Thanks

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused