Distributed Time Travel for Feature Generation. Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi

Size: px

Start display at page:

Download "Distributed Time Travel for Feature Generation. Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi"

Lee Fitzgerald
5 years ago
Views:

1 Distributed Time Travel for Feature Generation Prasanna Padmanabhan DB Tsai Mohammad H. Taghavi

2 Turn on Netflix, and the absolute best content for you would automatically start playing

3 Everything is a Recommendation Rows Ranking Over 80% of what members watch comes from our recommendations Recommendations are driven by Machine Learning Algorithms

4 Data Driven Try an idea offline using historical data to see if it would have made better recommendations If it did, deploy a live A/B test to see if it performs well in Production

5 Why build a Time Machine?

6 Quickly try ideas on historical data and transition to online A/B test

7 The Past Generate features based on event data logged in Hive Need to reimplement features for online A/B test Data discrepancies between offline and online sources Log features online where the model will be used Need to deploy each idea into production Feature generation calls online services and filters data past a certain time Works only when a service records a log of historical events Additional load on online services

8 DeLorean image by JMortonPhoto.com & OtoGodfrey.com

9 Time Travel using Snapshots Snapshot online services and use the snapshot data offline to generate features Share facts and features between experiments without calling live systems

10 How to build a Time Machine

11 Context Selection Data Snapshots APIs for Time Travel

12 Context Selection Runs once a day Hive Stratified Sampling Context Selection Contexts tagged with meta data S3 Context Set

Prana (Netflix Libraries) Thrift Data Snapshots

13 Data Snapshots S3 Viewing History Service Context Set Snapshot data for each Context MyList Service Prana (Netflix Libraries) Thrift Data Snapshots Runs once a day Ratings Service S3 Snapshot Parquet

14 APIs for Time Travel

15 Data Architecture Runs once a day Hive Stratified Sampling Context Selection Contexts tagged with meta data S3 Context Set Context Selection Viewing History Service MyList Service Prana (Netflix Libraries) Thrift Data Snapshots Data Snapshots Runs once a day Ratings Service Batch APIs S3 Snapshot Batch APIs RDD of Snapshot Objects

16 Generating Features via Time Travel

17 Great Scott! There s the DeLorean! DeLorean: A time-traveling vehicle uses data snapshots to travel in time scales with Apache Spark prototypes new ideas with Zeppelin requires minimal code changes from experimentation to A/B test to production

18 Running Time Travel Experiment Select the destination time Bring it up to 88 miles per hour!

Running Time Travel Experiment Offline Experiment Online System Design a New Experiment to Test Out Different Ideas Bad Metrics Design Experiment Model Testing Collect Label Dataset DeLorean: Offline

19 Running Time Travel Experiment Offline Experiment Online System Design a New Experiment to Test Out Different Ideas Bad Metrics Design Experiment Model Testing Collect Label Dataset DeLorean: Offline Feature Generation Distributed Model Training Parallel training of individual models using different executors Choose best model Compute Validation Metrics Good Metrics Online AB Testing Selected Contexts

20 DeLorean Input Data Contexts: The setting for evaluating a set of items (e.g. tuples of member profiles, country, time, device, etc.) Items: The elements to be trained on, scored, and/or ranked (e.g. videos, rows, search entities). Labels: For supervised learning, this will be the label (target) for each item.

21 Feature Encoders Compute features for each item in a given context Each type of raw data element has its own data key Data map is a map from data keys to data objects in a given context Data map is consumed by feature encoder to compute features

22 Two type of Data Elements Context-dependent data elements Viewing History Mylist... Context-independent data elements Video Metadata Genre Metadata...

Context Independent Video Data Element Video Metadata Context: s Items: Country of Origin Matching Fraction Features Context: Items: s Context-Items =

23 Context Independent Video Data Element Video Metadata Context: s Items: Country of Origin Matching Fraction Features Context: Items: s Context-Items = 0.5 Context: = 0.5 Items: s = 0.5 Context Dependent Data Element Viewing History Context: Items: Context: Items: s s Context: = 1.0 Items: = 0.0 = 1.0 s

POJOs Data Keys Required Feature Keys Data Map Feature Encoders

24 Feature Generation S3 Snapshot Label Data Data Feature Model (JSON) Feature Encoders Feature Encoders Data Elements Data Keys Data in POJOs Data Keys Required Feature Keys Data Map Feature Encoders Feature Encoders Feature Encoders Features Label Features Model Training

25 Features Represented in Spark s DataFrames In nested structure to avoid data shuffling in ranking process Stored with Parquet format in S3

26 Features Context Item, label, and features

Going Online S3 Snapshot DeLorean: Offline

Feature Encoders Online Feature Generation

27 Going Online S3 Snapshot DeLorean: Offline Feature Generation Model Training / Validation / Testing Offline Experiment Viewing History Service MyList Service Ratings Service Shared Feature Encoders Online Feature Generation Deploy models Online Ranking / Scoring Service Online System

28 Conclusion Spark helped us significantly reduce the time from an idea to an AB Test

29 Future work Event Driven Data Snapshots Time Travel to the Future!!

30 We re hiring! (come talk to us) Tech Blog:

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following: