Antonio Fernández Anta

Size: px

Start display at page:

Download "Antonio Fernández Anta"

Lora Hamilton
6 years ago
Views:

1 Antonio Fernández Anta Joint work with Luis F. Chiroque, Héctor Cordobés, Rafael A. García Leiva, Philippe Morere, Lorenzo Ornella, Fernando Pérez, and Agustín Santos

Shopping websites (Amazon), - Content distribution

2 Recommendation Engines (RE) suggest items to users RE are becoming highly popular in many different contexts - Shopping websites (Amazon), - Content distribution (Netflix, Spotify) - Online social networks (Facebook, twitter, LikendIn)

3 Some authors talk about the age of recommendation versus the age of search (Chris Anderson, The Long Tail)

4 Most modern RE are based on collaborative filtering: Recommendations are based on - Historic data - Similarity between users - Similarity between items Multiple metrics to quantify similarity: Euclidean distance, Cosine similarity, Pearson correlation similarity, etc.

5 The users and items are typically connected as a bipartite graph Considering users as nodes and items as hyperlinks (or vice versa) we obtain hypergraphs (that can be transformed into graphs of users or items)

6 Graph theory and network analysis concepts can be useful (as before) - Google, Pagerank - Natural Language Processing

- Apps are advertised in other apps via banners - Performance metric is

7 We explore graph-based approaches for recommendation engines Apply them to an ecosystem of smartphone apps. - Apps are advertised in other apps via banners - Performance metric is click-through rate (CTR) Large (big) data available - More than a billion records - Millions of users - Hundreds of items (apps)

8 Recommendation engines based on - Collaborative filtering (in wide sense) - Graph theory concepts The engines were evaluated in the real world (and showed good performance)

9 Involves processing the historical data available with big data technologies: Hadoop, Elastic Map Reduce, Pig The processing is done is a few hours thanks to these technologies

millions records The data is in multiple tables The

10 There are more that 100 GiB of historical data - Millions of users applications - More than 1400 millions records The data is in multiple tables The first process is to clean the data This involves joining several tables

11 The output of this process is a file with records that contain - User - Application advertised - Running application (publisher app) - Action (advertisement, click) - Date and time In MySQL we started in of the join operations (not the largest) and stopped it after 30 hours and no more than 15% completion

12 We have used Hadoop on Amazon Elastic Map Reduce and Pig scripts to process the data The output has more that 700 million records

13 From the clean data the graphs used by the RE have been generated This process is less time consuming since it typically involves data aggregation

14 The graph that has apps as nodes and undirected links weighted by the number of common users

15 Shared users: The apps with largest number of users shared with the publisher app are preferred recommendation Publisher app

16 Filtering algorithm: - Let v the binary vector of the requesting user apps, and M the adjacency matrix of the common users graph - The apps whose position have the largest value in v T M are preferred recommendation

17 Common users graph, modulated by age (user weight decreases exponentially with age) Weight(app1,app2)= Σ u δ age(u) Aged shared users: Same as shared users in the aged graph Aged filtering: Same as shared users in the aged graph

18 The CTR graph is a directed graph with links weighted by the frequency of clicks in the banner of the application (head) in the publisher (tail) 5/20 4/12 25/50 6/10 22/40 7/14 4/50

19 Maxflow algorithm: - Recommendation algorithm used to promote specific applications - The apps with largest maxflow in the CTR graph to the promoted apps are preferred recommendation source Promoted app

20 For reference there are two basic recommendation engines: - Random: Engine that suggests random applications - Static promotion: Returns always the promoted apps

64% Aged shared users 1.69% Filtering 1.51% Aged filtering 1.

21 The different algorithms have been tested over a week in the real system Algorithm CTR Random 1.57% Shared users 1.64% Aged shared users 1.69% Filtering 1.51% Aged filtering 1.71% Static promotion 1.45% Maxflow 1.86% Aging is useful Global view These values improve over the current CTR

Graph analytics can be useful in the development of recommendation engines Big data technology allowed us to process historical data and produce graphs The

22 Graph analytics can be useful in the development of recommendation engines Big data technology allowed us to process historical data and produce graphs The graphs generated are small. They could be processed with classical technologies Current map reduce technologies do not seem to be the solution for large graph analysis

23 Explore technologies that are more suited for large graph analytics: - Graphlab, GraphChi - Spark, GraphX - Stratosphere, Flink, Spargel Devise ways to process incremental data Design and testing of new recommendation algorithms that use larger graphs

24 Thank you!

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab

Recommender System. What is it? How to build it? Challenges. R package: recommenderlab Recommender System What is it? How to build it? Challenges R package: recommenderlab 1 What is a recommender system Wiki definition: A recommender system or a recommendation system (sometimes replacing