Real-time Recommendations on Spark. Jan Neumann, Sridhar Alla (Comcast Labs) DC Spark Interactive Meetup East May

Size: px

Start display at page:

Download "Real-time Recommendations on Spark. Jan Neumann, Sridhar Alla (Comcast Labs) DC Spark Interactive Meetup East May"

Agnes Rice
5 years ago
Views:

1 Real-time Recommendations on Spark Jan Neumann, Sridhar Alla (Comcast Labs) DC Spark Interactive Meetup East May

2 Who am I? Jan Neumann, Lead of Big Data and Content Analysis Research Teams This is joint work with Sridhar Alla, Director/Big Data Architect, Enterprise BI Comcast Labs DC Responsibilities Develop Content Discovery and Metadata Back-end Services for Comcast Innovation Voice Interface for TV Machine Learning/Data Science Expertise for all of Comcast

3 Comcast Labs DC powers all CONTENT DISCOVERY for X1 Search Algorithmic Menus Poster Art Video On-Demand Live TV Personalized Recommendations

4 Who are we similar to? METADATA LIKE SEARCH LIKE RECOMMENDATIONS LIKE Powering millions of devices Taking into account your TV channels, subscriptions and tastes Including live programming

How it all works Formed in 2011, we now operate one of the largest and

CONTENT IMAGES LOGOS MENU BILLING SYSTEMS CUSTOMER USAGE SUBSCRIBER

5 How it all works Formed in 2011, we now operate one of the largest and most sophisticated metadata and content discovery platforms in the industry. CONTENT INFORMATION METADATA PROVIDERS DISCOVERY CONTENT PROVIDERS CONTENT IMAGES LOGOS MENU BILLING SYSTEMS CUSTOMER USAGE SUBSCRIBER INFORMATION CATALOGS ENTITLEMENTS RECOMMEND BROWSE PERSONALIZE VOICE CONTROL SEARCH MILLIONS OF DEVICES CHANNEL LINEUPS DEEP METADATA PURCHASES

6 What are recommendation systems? Recommendation systems (RS) are everywhere Video (Comcast, Netflix) Music (Apple Genius, Spotify, Pandora) Products (Amazon, Ebay) Targeted Advertisements Search Recommendations Items

7 What are recommendation systems? Recommendation systems (RS) are everywhere Video (Comcast, Netflix) Music (Apple Genius, Spotify, Pandora) Products (Amazon, Ebay) Targeted Advertisements RS help to match users with items Ease information overload - long-tail Sales assistance (guidance, advisory, persuasion, ) Search Recommendations Items

8 What are recommendation systems? Recommendation systems (RS) are everywhere Video (Comcast, Netflix) Music (Apple Genius, Spotify, Pandora) Products (Amazon, Ebay) Targeted Advertisements RS help to match users with items Ease information overload - long-tail Sales assistance (guidance, advisory, persuasion, ) Search Recommendations Different system designs / paradigms Based on availability of exploitable data Implicit and explicit user feedback Domain characteristics Items

9 Importance of recommending from the long tail

10 Content-based Recommendations Main idea: Recommend items to customer x similar to previous items rated highly by x Example: Movie recommendations Recommend movies with same actor(s), director, genre, Websites, blogs, news Recommend other sites with similar content From J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

11 Pros: Content-based Approach +: No need for data on other users No cold-start or sparsity problems +: Able to recommend to users with unique tastes +: Able to recommend new & unpopular items No first-rater problem +: Able to provide explanations Can provide explanations of recommended items by listing contentfeatures that caused an item to be recommended From J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

12 Cons: Content-based Approach : Finding the appropriate features is hard E.g., images, movies, music : Recommendations for new users How to build a user profile? : Overspecialization Never recommends items outside user s content profile People might have multiple interests Unable to exploit quality judgments of other users From J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

suggests Megadeth from data collected about customer X From J.

13 Collaborative Filtering: Example Customer X Customer Y Buys Metallica CD Does search on Metallica Buys Megadeth CD RS suggests Megadeth from data collected about customer X From J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

14 Collaborative Filtering From J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

15 Pros/Cons of Collaborative Filtering + Works for any kind of item No feature selection needed - Cold Start: Need enough users in the system to find a match - Sparsity: The user/ratings matrix is sparse Hard to find users that have rated the same items - First rater: Cannot recommend an item that has not been previously rated New items, Esoteric items - Popularity bias: Cannot recommend items to someone with unique taste Tends to recommend popular items From J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

16 This Talk: Real-time TV Recommendations + = Trending For You

17 How can we do it? Challenges We have millions of users, thousands of programs Programs on live TV are constantly changing (Cold-Start) Approach inspired by Google news personalization: scalable online collaborative filtering, Das et al., 2007 Recommend what people in your geographic area with a taste similar to you are currently watching and do it in Spark.

18 Real-time Recommendations Algorithm Cluster user by taste profiles and geographic proximity Calculate Top K trending programs for each cluster Look up cluster for user and return trending programs User Request User-> Cluster Cluster -> Trending Programs User -> Trending Programs

19 Real-time Recommendations in Spark Thanks to Spark we can implement this quickly User History from HDFS Live Tune Activity via Kafka Batch: User Clustering with MlLib Real-time: TopK Trending Programs per Cluster w/ Spark Streaming Real-time Program recommendations per user

20 Collaborative Filtering Implementation: Matrix Factorization

21 Collaborative Filtering Implementation: Matrix Factorization

22 No Ratings: Implicit Matrix Factorization ALS.trainImplicit(view_count,k,numIter,alpha,lambda) For more info see Music Recommendations with Spark, Chris Johnson (Spotify), Spark Summit View Count in % watched min Confidence = f(view count) Preference Matrix U 1 U 2 U 3 U 4 U 5 User Matrix #users*k M 1 M 2 M 3 Movie Matrix k*#movies

23 Math Detour : Cluster Normalized User Vectors Spark KMeans can only cluster points in Euclidean space Cannot cluster preference vectors directly: P 1 P 2 2 = U 1 M U 2 M 2 = (U 1 U 2 )MM T (U 1 U 2 ) T U 1 U 2 2 M T X D Y T X T

24 Batch: Cluster Users based on their Tastes Group users by geographic area Compute user taste vector from viewing history for each geographic area Cluster users to find groups with similar tastes Usage history aggregation Implicit Matrix Factorization SVD for Dimensionality Reduction Kmeans Clustering of Users User-> Cluster

25 Batch Implementation # for each geographic region # convert user viewing history to ratings (hash user_id to int) val user_history = sc.textfile( user_history.dat ) val ratings = user_history.flat_map(parse_ratings)

26 Batch Implementation # for each geographic region # convert user viewing history to ratings (hash user_id to int) val user_history = sc.textfile( user_history.dat ) val ratings = user_history.flat_map(parse_ratings) # build matrix factorization model val mf_model = ALS.train_implicit(ratings, rank, n, lambda, alpha)

27 Batch Implementation # transform the movie feature matrix val productrows = mf_model.productfeatures.map(s=>vectors.dense(s._2)) val productrowmatrix = new RowMatrix(productRows) val productsvd = productrowmatrix.computesvd(svdrank) val userfeatures = userrowmatrix.multiply(productsvd.v).multiply(matrices.diag(productsvd.s))

28 Batch Implementation # transform the movie feature matrix val productrows = mf_model.productfeatures.map(s=>vectors.dense(s._2)) val productrowmatrix = new RowMatrix(productRows) val productsvd = productrowmatrix.computesvd(svdrank) val userfeatures = userrowmatrix.multiply(productsvd.v).multiply(matrices.diag(productsvd.s)) # use latent taste space to cluster users val cluster_model = KMeans(userFeatures.rows,numClusters,numIter)

29 Real-time Data Ingest How data flows into the system Back-end servers log user interactions with STB Data is processed and formatted using Flume Data is then passed on to Real-time Process using Storm Long-storage in HDFS For external consumption via access-contolled Kafka/REST We connect to Kafka (real-time) or access data directly from HDFS (batch)

30 Real-time: Compute TopK TV programs per cluster What is each viewer watching? Aggregate popular programs across users for each cluster Keep Top K (using Twitter Algebird TopK Monoid) Update Viewer State Count Programs being viewed Map Program Counts to User Clusters Create TopK Program List for each Cluster Cluster -> Trending Programs

31 Real-Time Implementation // format event_time device_id program_id station_id dma_title tune_type // get data from Kafka val tuneeventsperuser = KafkaUtils.createStream(ssc, zkquorum, groupid, topics, storagelevel).flatmap(parsetuneeventbyuser)

32 Real-Time Implementation // format event_time device_id program_id station_id dma_title tune_type // get data from Kafka val tuneeventsperuser = KafkaUtils.createStream(ssc, zkquorum, groupid, topics, storagelevel).flatmap(parsetuneeventbyuser) // what is being watched by each user val userstate = tuneeventsperuser.updatestatebykey(updateuserhistory).cache()

33 Real-Time Implementation // format event_time device_id program_id station_id dma_title tune_type // get data from Kafka val tuneeventsperuser = KafkaUtils.createStream(ssc, zkquorum, groupid, topics, storagelevel).flatmap(parsetuneeventbyuser) // what is being watched by each user val userstate = tuneeventsperuser.updatestatebykey(updateuserhistory).cache() // aggregate tunes per program per cluster val tvtunes = userstate.map { case (userid,tuneinfo) => ((tuneinfo.programid,user2cluster(userid)),1) }.reducebykey(_+_)

34 Compute Top K Programs per Cluster import com.twitter.algebird.topkmonoid case class ProgramCount (val programid:long, val count: Int) extends Ordered[ProgramCount] { def compare(that: ProgramCount):Int = {... } } val topkmonoid = new TopKMonoid[ProgramCount](topk)

35 Compute Top K Programs per Cluster import com.twitter.algebird.topkmonoid case class ProgramCount (val programid:long, val count: Int) extends Ordered[ProgramCount] { def compare(that: ProgramCount):Int = {... } } val topkmonoid = new TopKMonoid[ProgramCount](topk) val tvtopk = tvtunes.map { case ((programid,clusterid),cnt) => (clusterid,topkmonoid.build(programcount(programid,cnt))) }.reducebykey(topkmonoid.plus)

36 Compute Top K Programs per Cluster import com.twitter.algebird.topkmonoid case class ProgramCount (val programid:long, val count: Int) extends Ordered[ProgramCount] { def compare(that: ProgramCount):Int = {... } } val topkmonoid = new TopKMonoid[ProgramCount](topk) val tvtopk = tvtunes.map { case ((programid,clusterid),cnt) => (clusterid,topkmonoid.build(programcount(programid,cnt))) }.reducebykey(topkmonoid.plus) // export top tunes to Redis for lookup by web service tvtopk.foreachrdd(rdd => rdd.foreachpartition(p => (savetoredis(p)))

37 Results Leverage existing Hadoop infrastructure and data Compute 10 user clusters for 100k users in less than 10 minutes using 100 cores, 128GB RAM Consume STB events on a real time basis directly from Kafka Calculate Top K trending programs for each cluster in 20 sec micro batches storing the results in Redis. Service requests for Personalized Trending Shows = Happy Customer

38 Trending for You Web Service Morning

39 Trending for You Web Service Evening

40 Final Words Thanks to Spark we implemented first version in 2-3 weeks Example accelerated adoption of Spark in dev & research Many further improvements possible Do time-dependent clustering of user tastes Gather feedback from real users We are hiring! Contact us at jobs.comcast.com

42 Math Detour : Cluster Normalized User Vectors Spark Kmeans can only cluster elements in Euclidean space Problem: P 1 P 2 2 = U 1 M U 2 M 2 = (U 1 U 2 )MM T (U 1 U 2 ) T U 1 U 2 2 M.computeSVD M T = X D Y T where X T X = I, Y T Y = I, D = diag

43 Math Detour : Cluster Normalized User Vectors Spark Kmeans can only cluster elements in Euclidean space Problem: P 1 P 2 2 = U 1 M U 2 M 2 = (U 1 U 2 )MM T (U 1 U 2 ) T U 1 U 2 2 M.computeSVD M T = X D Y T where X T X = I, Y T Y = I, D = diag (U 1 U 2 )MM T (U 1 U 2 ) T = (U 1 U 2 )YDDY T (U 1 U 2 ) T = U 1 U 2 2 With U = U Y D

44 Math Detour : Cluster Normalized User Vectors Spark Kmeans can only cluster elements in Euclidean space Problem: P 1 P 2 2 = U 1 M U 2 M 2 = (U 1 U 2 )MM T (U 1 U 2 ) T U 1 U 2 2 M.computeSVD M T = X D Y T where X T X = I, Y T Y = I, D = diag (U 1 U 2 )MM T (U 1 U 2 ) T = (U 1 U 2 )YDDY T (U 1 U 2 ) T 2 = U 1 U 2 With U = U Y D KMeans.train( U,numClusters,numIter)

45 Math Detour : Cluster Normalized User Vectors Spark Kmeans can only cluster elements in Euclidean space P 1 P 2 2 = U 1 M U 2 M 2 = (U 1 U 2 )MM T (U 1 U 2 ) T U 1 U 2 2 M.computeSVD M T = X D Y T where X T X = I, Y T Y = I, D = diag (U 1 U 2 )MM T (U 1 U 2 ) T = (U 1 U 2 )YDDY T (U 1 U 2 ) T 2 = U 1 U 2 With U = U Y D KMeans.train( U,numClusters,numIter)

Geared towards males The Lion King The Princess Independence Diaries Day Dumb and Funny

46 Motivation behind Matrix Factorizations (Latent Space Models) The Color Purple Serious Amadeus Braveheart Geared towards females Sense and Sensibility Ocean s 11 Lethal Weapon Geared towards males The Lion King The Princess Independence Diaries Day Dumb and Funny Dumber From J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,

47 Factor 2 The Effect of Regularization The Color Purple serious Amadeus Braveheart Geared towards females Sense and Sensibility Ocean s 11 Lethal Weapon Factor 1 Geared towards males min P, Q xi training 2 ( r q p ) The Princess Diaries i x min factors error + length x p 2 x i q 2 i The Lion King funny Independence Day Dumb and Dumber

48 Factor 2 The Effect of Regularization The Color Purple serious Amadeus Braveheart Geared towards females Sense and Sensibility Ocean s 11 Lethal Weapon Factor 1 Geared towards males min P, Q xi training 2 ( r q p ) The Princess Diaries i x min factors error + length x p 2 x i q 2 i The Lion King funny Independence Day Dumb and Dumber

49 Factor 2 The Effect of Regularization The Color Purple serious Amadeus Braveheart Geared towards females Sense and Sensibility Ocean s 11 Lethal Weapon Factor 1 Geared towards males min P, Q xi training 2 ( r q p ) The Princess Diaries i x min factors error + length x p 2 x i q 2 i The Lion King funny Independence Day Dumb and Dumber

50 Factor 2 The Effect of Regularization The Color Purple serious Amadeus Braveheart Geared towards females Sense and Sensibility Ocean s 11 Lethal Weapon Factor 1 Geared towards males min P, Q xi training 2 ( r q p ) The Princess Diaries i x min factors error + length x p 2 x i q 2 i The Lion King funny Independence Day Dumb and Dumber

CS 124/LINGUIST 180 From Languages to Information

CS 124/LINGUIST 180 From Languages to Information CS /LINGUIST 80 From Languages to Information Dan Jurafsky Stanford University Recommender Systems & Collaborative Filtering Slides adapted from Jure Leskovec Recommender Systems Customer X Buys Metallica