Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Similar documents
Creating a Recommender System. An Elasticsearch & Apache Spark approach

Deploying Machine Learning Models in Practice

Practical Machine Learning Agenda

Multi-domain Predictive AI. Correlated Cross-Occurrence with Apache Mahout and GPUs

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

WHITEPAPER. MemSQL Enterprise Feature List

Kibana, Grafana and Zeppelin on Monitoring data

Search Engines and Time Series Databases

Big Data Architect.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Application monitoring with BELK. Nishant Sahay, Sr. Architect Bhavani Ananth, Architect

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Ingest. Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017

Ingest. David Pilato, Developer Evangelist Paris, 31 Janvier 2017

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Fluentd + MongoDB + Spark = Awesome Sauce

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Crossing the AI Chasm. What We Learned from building Apache PredictionIO (incubating)

ADVANCED DATABASES CIS 6930 Dr. Markus Schneider. Group 5 Ajantha Ramineni, Sahil Tiwari, Rishabh Jain, Shivang Gupta

BIG DATA COURSE CONTENT

Introduction to the Europeana SIP CREATOR

Advanced ecommerce Monitoring one tool does it all

Thales PunchPlatform Agenda

Real-time Streaming Applications on AWS Patterns and Use Cases

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Turbocharge your MySQL analytics with ElasticSearch. Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017

Search Quality Evaluation Tools and Techniques

microsoft

Microservice Layout in Netflix

Goal of this document: A simple yet effective

Overview of Data Services and Streaming Data Solution with Azure

Table 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti

Open Source Search. Andreas Pesenhofer. max.recall information systems GmbH Künstlergasse 11/1 A-1150 Wien Austria

Kafka Connect the Dots

Putting together the platform: Riak, Redis, Solr and Spark. Bryan Hunt

Down the event-driven road: Experiences of integrating streaming into analytic data platforms

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Hortonworks DataFlow Sam Lachterman Solutions Engineer

Deploying, Managing and Reusing R Models in an Enterprise Environment

Modern Data Warehouse The New Approach to Azure BI

Analytics Platform for ATLAS Computing Services

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Dissemination Web Service. Programmatic access to Eurostat data & metadata

PNDA.io: when BGP meets Big-Data

TIBCO Complex Event Processing Evaluation Guide

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Integrating New Visualizations with Pentaho Using the Viz API

LIFERAY IOT EXPERIENCE The platform that makes Liferay ready for Industry 4.0

E l a s t i c s e a r c h F e a t u r e s. Contents

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

Migrate from Netezza Workload Migration

Data Science Training

Flash Storage Complementing a Data Lake for Real-Time Insight

Distributed Time Travel for Feature Generation. Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi

Benchmarking Spark ML using BigBench. Sweta Singh TPCTC 2016

Oracle Big Data Discovery

Big Data Analytics Tools. Applied to ATLAS Event Data

Is Elasticsearch the Answer?

Deep dive into analytics using Aggregation. Boaz

SQL Server Machine Learning Marek Chmel & Vladimir Muzny

Apache Spark Tutorial

Understanding the workplace of the future. Artificial Intelligence series

Unifying Big Data Workloads in Apache Spark

Hadoop Development Introduction

Latest from the Lab: What's New Machine Learning Sam Buhler - Machine Learning Product/Offering Manager

Monitoring MySQL with Prometheus & Grafana

Real-time data processing with Apache Flink

Search and Time Series Databases

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

17/05/2017. What we ll cover. Who is Greg? Why PaaS and SaaS? What we re not discussing: IaaS

Personalizing Netflix with Streaming datasets

Semantic Web Company. PoolParty - Server. PoolParty - Technical White Paper.

Microsoft SharePoint 2010 The business collaboration platform for the Enterprise and the Web. We have a new pie!

Homework: Building an Apache-Solr based Search Engine for DARPA XDATA Employment Data Due: November 10 th, 12pm PT

mimacom & LiferayDXP Campaign

Lambda Architecture for Batch and Stream Processing. October 2018

HDInsight > Hadoop. October 12, 2017

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Apache Griffin Data Quality Solution for both streaming and batch

Enhancing applications with Cognitive APIs IBM Corporation

Technical Sheet NITRODB Time-Series Database

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Collaborative Filtering

Regain control thanks to Prometheus. Guillaume Lefevre, DevOps Engineer, OCTO Technology Etienne Coutaud, DevOps Engineer, OCTO Technology

Apache Ignite and Apache Spark Where Fast Data Meets the IoT

Performance and Scalability Overview

Realtime visitor analysis with Couchbase and Elasticsearch

YOU SUN JEONG DATA ANALYTICS WITH DRUID

Is NiFi compatible with Cloudera, Map R, Hortonworks, EMR, and vanilla distributions?

Big Data Integrator Platform Platform Architecture and Features Dr. Hajira Jabeen Technical Team Leader-BDE University of Bonn

Oracle Big Data Connectors

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Accelerating Spark Workloads using GPUs

Intelligent Services Serving Machine Learning

Talend Big Data Sandbox. Big Data Insights Cookbook

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

Transcription:

Nick Pentreath Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning Author of Machine Learning with Spark

Agenda Recommender systems & the machine learning workflow Data modelling for recommender systems Why Spark, Kafka & Elasticsearch? Kafka & Spark Streaming Spark ML for collaborative filtering Deploying & scoring recommender models with Elasticsearch Monitoring, feedback & re-training Scaling model serving Demo

Recommender Systems & the ML Workflow

Recommender Systems Overview

The Machine Perception Learning Workflow Data??? Machine Learning??? $$$

The Machine Reality Learning Workflow Spark ML Missing piece Various Spark DataFrames??? Data Ingest Data Processing Model Training Deploy Live System Historical Streaming Feature transformation & engineering Model selection & evaluation Pipelines, not just models Versioning Predict given new data Monitoring & live evaluation Feedback Loop Stream (Kafka)

The Machine Recommender Version Learning Workflow Spark ML Elasticsearch Spark DataFrames Elasticsearch Data Ingest Data Processing Model Training Deploy Live System User & Item Metadata Events Aggregation Handle implicit data ALS Ranking-style evaluation Model size & complexity User & item recommendations Monitoring, filters Feedback => another Event Type Stream (Kafka)

Data Modeling for Recommender Systems

User and Item Data model Metadata

User and Item System Requirements Metadata Filtering & Grouping Business Rules

Anatomy of a User Interactions User Event User interactions Implicit preference data Intent data Page view ecommerce - cart, purchase Media preview, watch, listen Search query Explicit preference data Social network interactions Rating Review Like Share Follow

Anatomy of a Data model User Event

Anatomy of a How to handle implicit feedback? User Event

Why Kafka, Spark & Elasticsearch?

Why Kafka? Scalability De facto standard for a centralized enterprise message / event queue Integration Integrates with just about every storage & processing system Good Spark Streaming integration 1 st class citizen Including for Structured Streaming (but still very new & rough)

Why Spark? DataFrames Events & metadata are lightly structured data Suited to DataFrames Pluggable external data source support Spark ML Spark ML pipelines including scalable ALS model for collaborative filtering Implicit feedback & NMF in ALS Cross-validation Custom transformers & algorithms

Why Elasticsearch? Storage Native JSON Scalable Good support for time-series / event data Kibana for data visualisation Integration with Spark DataFrames Scoring Full-text search Filtering Aggregations (grouping) Search ~== recommendation (more later)

Kafka for Recommender Systems

Event Data Pipeline User analytics & aggregation Kafka Spark Streaming Event store Dashboards Item analytics & aggregation

Write to Event Store Event store Spark Streaming

Kibana Dashboards Spark Streaming Dashboards

Item Metadata Analytics Spark Streaming Aggregated activity metrics Item analytics & aggregation

User Metadata Analytics User analytics & aggregation Spark Streaming Aggregated activity metrics & item exclusions

Structured Status Streaming Still early days Initial Kafka support in Spark 2.0.2 No ES support yet not clear if it will be a full-blown datasource or ForeachWriter For now, you can create a custom ForeachWriter for your needs

Spark ML for Collaborative Filtering

Collaborative Matrix Factorization Filtering 3 1.1 3.2 4.3 1 0.2 1.4 3.1 0.2 1.7 2.3 0.1 5 2 2.5 0.3 2.3 1.9 0.4 0.8 0.3 1 4.3 2.4 0.5 1.5 1.2 0.3 1.2 3.6 0.3 1.2 2 1

Collaborative Prediction Filtering 3 1.1 3.2 4.3 1 0.2 1.4 3.1 0.2 1.7 2.3 0.1 5 2 2.5 0.3 2.3 1.9 0.4 0.8 0.3 1 4.3 2.4 0.5 1.5 1.2 0.3 1.2 3.6 0.3 1.2 2 1

Collaborative Loading Data in Spark ML Filtering

Alternating Least Implicit Preference Data Squares

Deploying & Scoring Recommendation Models

Prelude: Search Analysis cat videos Full-text Search & Similarity Scoring Term vectors cat 0 1 videos 1 Ranking 0 Similarity 0 0 1 1 0 1 1 0 Sort results 0 1 0 0 1 1 0 1

Recommendation Can we use the same machinery?? Analysis Term vectors Scoring Ranking 1.2 0.2 User (or item) vector Dot product & cosine similarity the same as we need for recommendations 0.3 Similarity 0 0 0 1 1 1 1 0 0 1 1 1 0 0 0 1 Sort results

Elasticsearch Delimited Payload Filter Term Vectors Raw vector Custom analyzer Term vector with payloads 1.2 0.2 0.3 0 1.2 3-0.2 4 0.3

Elasticsearch Custom scoring function Scoring Native script (Java), compiled for speed Scoring function computes dot product by: For each document vector index ( term ), retrieve payload score += payload * query(i) Normalizes with query vector norm and document vector norm for cosine similarity

Recommendation Can we use the same machinery? Analysis Term vectors Scoring Ranking User (or item) vector 1.2 0.2 Delimited payload filter 0.3 Custom scoring function 1.1 1.3 0.4 1.2 0.2 0.3 0.5 0.7 1.3 0.9 1.4 0.8 Sort results

Elasticsearch We get search engine functionality for free Scoring

Alternating Least Deploying to Elasticsearch Squares

Monitoring & Feedback

System Events Logging Recommendations Served

System Events Logging Recommendation Actions

Tracking Performance Performance monitoring & alerts Event store Kafka Spark Streaming Dashboards Impression capping / fatigue

Scaling Model Scoring

Scoring Performance 600 500 k=20 k=50 k=100 Scoring time per query, by factor dimension & number of items Time (ms) 400 300 200 100 0 100,000 1,000,000 Size of item set *3x nodes, 30x shards

Scoring Increasing number of shards Performance Time (ms) 500 450 400 350 300 250 200 150 100 50 0 Scoring time per query, by number of shards & number of items 10 shards 30 shards 60 shards 90 shards 100,000 1,000,000 Size of item set *3x nodes, k=50

Scoring Locality Sensitive Hashing Performance LSH hashes each input vector into L hash tables. Each table contains a hash signature created by applying k hash functions. Standard for cosine similarity is Sign Random Projections At indexing time, create a bucket by combining hash table id and hash signature Store buckets as part of item model metadata At scoring time, filter candidate set using term filter on buckets of query item Tune LSH parameters to trade off speed / accuracy LSH coming soon to Spark ML SPARK-5992

Scoring Locality Sensitive Hashing Performance 250 Scoring time per query - brute force vs LSH 200 Time (ms) 150 100 50 0 Brute force LSH *3x nodes, 30x shards, k=50, 1,000,000 items

Scoring Comparison to score then search Performance 250 Score Sort Search Scoring time per query LSH vs score-then-search 200 Time (ms) 150 100 50 0 Brute force LSH Score-then-search *3x nodes, 30x shards, k=50, 1,000,000 items

Demo

Future Work

Future Work Apache Solr version of scoring plugin (any takers?) Investigate ways to improve Elasticsearch scoring performance Performance for LSH-filtered scoring should be better Can we dig deep into ES scoring internals to combine efficiency of matrix-vector math with ES search & filter capabilities? Investigate more complex models Factorization machines & other contextual recommender models Scoring performance Spark Structured Streaming with Kafka, Elasticsearch & Kibana Continuous recommender application including data, model training, analytics & monitoring

References Elasticsearch Elasticsearch Spark Integration Spark ML ALS for Collaborative Filtering Collaborative Filtering for Implicit Feedback Datasets Factorization Machines Elasticsearch Term Vectors & Payloads Delimited Payload Filter Vector Scoring Plugin Kafka & Spark Streaming Kibana

Thanks https://github.com/mlnick/elasticsearch-vector-scoring