Parallel machine learning using Menthor

Similar documents
Collaborative Filtering for Netflix

Singular Value Decomposition, and Application to Recommender Systems

I. This material refers to mathematical methods designed for facilitating calculations in matrix

Using Social Networks to Improve Movie Rating Predictions

Mathematics. 2.1 Introduction: Graphs as Matrices Adjacency Matrix: Undirected Graphs, Directed Graphs, Weighted Graphs CHAPTER 2

REGULAR GRAPHS OF GIVEN GIRTH. Contents

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

x = 12 x = 12 1x = 16

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

Matrices. Chapter Matrix A Mathematical Definition Matrix Dimensions and Notation

Matrix Inverse 2 ( 2) 1 = 2 1 2

ACM ICPC World Finals 2011

CPSC 535 Assignment 1: Introduction to Matlab and Octave

Maths for Signals and Systems Linear Algebra in Engineering. Some problems by Gilbert Strang

1 Graph Visualization

Recommender Systems New Approaches with Netflix Dataset

SCIE 4101, Spring Math Review Packet #4 Algebra II (Part 1) Notes

Chapter 2 Basic Structure of High-Dimensional Spaces

Towards a hybrid approach to Netflix Challenge

Curriculum Map: Mathematics

Matrix Computations and " Neural Networks in Spark

Matrices. A Matrix (This one has 2 Rows and 3 Columns) To add two matrices: add the numbers in the matching positions:

Natural Language Processing with Deep Learning CS224N/Ling284. Christopher Manning Lecture 4: Backpropagation and computation graphs

Matlab notes Matlab is a matrix-based, high-performance language for technical computing It integrates computation, visualisation and programming usin

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

A MATLAB Interface to the GPU

Table of Laplace Transforms

Chapter 18 out of 37 from Discrete Mathematics for Neophytes: Number Theory, Probability, Algorithms, and Other Stuff by J. M. Cargal.

Chapter 4. Matrix and Vector Operations

CHAPTER 8 COMPOUND CHARACTER RECOGNITION USING VARIOUS MODELS

Parallelization of a collaborative filtering algorithm with Menthor

Performance of Alternating Least Squares in a distributed approach using GraphLab and MapReduce

Image Compression With Haar Discrete Wavelet Transform

CS 229 Final Report: Artistic Style Transfer for Face Portraits

Extra-High Speed Matrix Multiplication on the Cray-2. David H. Bailey. September 2, 1987

Structured System Theory

Structural and Syntactic Pattern Recognition

Lecture 5. Treaps Find, insert, delete, split, and join in treaps Randomized search trees Randomized search tree time costs

Page 1. Program Performance Metrics. Program Performance Metrics. Amdahl s Law. 1 seq seq 1

Basic matrix math in R

1. A busy airport has 1500 takeo s perday. Provetherearetwoplanesthatmusttake o within one minute of each other. This is from Bona Chapter 1 (1).

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Programming Problem for the 1999 Comprehensive Exam

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

Notes on efficient Matlab programming

Introduction to Matlab

High Performance Computing

Introduction to Scientific Computing with Matlab

To add or subtract, just add or subtract the numbers in the same column and row and place answer accordingly.

Assignment 5: Collaborative Filtering

Report: Privacy-Preserving Classification on Deep Neural Network

Progress Report: Collaborative Filtering Using Bregman Co-clustering

Tutorial Four: Linear Regression

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

CSCI5070 Advanced Topics in Social Computing

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

Graph Theory Problem Ideas

COMP 465: Data Mining Recommender Systems

Self-study session 1, Discrete mathematics

Figure 2.1: A bipartite graph.

10.4 Linear interpolation method Newton s method

Parallel and perspective projections such as used in representing 3d images.

LDPC Codes a brief Tutorial

Parallelizing The Matrix Multiplication. 6/10/2013 LONI Parallel Programming Workshop

Teaching Manual Math 2131

10th August Part One: Introduction to Parallel Computing

GraphBLAS Mathematics - Provisional Release 1.0 -

Semi-Automatic Transcription Tool for Ancient Manuscripts

Image Manipulation in MATLAB Due Monday, July 17 at 5:00 PM

MAT 003 Brian Killough s Instructor Notes Saint Leo University

Image Compression using Singular Value Decomposition (SVD)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Systems of Inequalities and Linear Programming 5.7 Properties of Matrices 5.8 Matrix Inverses

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Lecture 9 - Matrix Multiplication Equivalences and Spectral Graph Theory 1

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

The Hill Cipher. In 1929 Lester Hill, a professor at Hunter College, published an article in the American

Machine learning in fmri

APPM 2360 Project 2 Due Nov. 3 at 5:00 PM in D2L

Planar Graphs with Many Perfect Matchings and Forests

Monday, 12 November 12. Matrices

QUADRATIC UNIFORM B-SPLINE CURVE REFINEMENT

Graph drawing in spectral layout

2D transformations: An introduction to the maths behind computer graphics

Weighted Alternating Least Squares (WALS) for Movie Recommendations) Drew Hodun SCPD. Abstract

Multiple View Geometry in Computer Vision

Quadratic Functions. Full Set of Notes. No Solutions

Recommender Systems (RSs)

Matlab and Octave: Quick Introduction and Examples 1 Basics

Rational Trigonometry Applied to Robotics

Year 7 Unit 1 - number properties 1

Multi-Criteria Decision Making 1-AHP

The Matrix-Tree Theorem and Its Applications to Complete and Complete Bipartite Graphs

Lecture 11: Maximum flow and minimum cut

MA/CSSE 473 Day 20. Recap: Josephus Problem

So we have been talking about 3D viewing, the transformations pertaining to 3D viewing. Today we will continue on it. (Refer Slide Time: 1:15)

Problem set 2. Problem 1. Problem 2. Problem 3. CS261, Winter Instructor: Ashish Goel.

TRICAP Chania. Typical rank when arrays have symmetric slices, and the Carroll & Chang conjecture of equal CP components.

LDPC Simulation With CUDA GPU

The Composition of Number

Transcription:

Parallel machine learning using Menthor Studer Bruno Ecole polytechnique federale de Lausanne June 8, 2012 1 Introduction The algorithms of collaborative filtering are widely used in website which recommend some items, it could be movies, commercial product or even dating site. These website use the users registered and the ratings they gives or the habits they have to form their dataset. Of course the larger the dataset the better result this category of algorithms yield. Since we want the bigger dataset possible we also need a e cient way to compute the recomendation ; and of course if we cannot find more e cient algorithm, we can parallelize it. The last question remaining is : how well theses algorithms can be parallelize? To answer this question we will investigate the scala parallel collection which are pretty straigh-forward ; and we will also try to transform our problem into a graph problem to fit the Menthor framework. 2 Framework Menthor 1 is a framework developed by Heather Miller and Philipp Haller at the EPFL in Scala. The principle of menthor is to represent the dataset as a graph, and working locally on the vertices. Then there is a communication over the edges using messages ; and finally we have a crunch step who aggregate all the data. The algorithm must be split into two synchronized steps : - The update step, where the vertices must locally update themselves using the incoming messages form the adjacent vertices. - The crunch step, which operate like a huge reduce, and once all the data are processed, make the result available to all the vertices for the next step. More information on Menthor can be found in the authors publication 2 1 Project page: http://lcavwww.epfl.ch/ hmiller/menthor/ 2 Publication: http://lcavwww.epfl.ch/ hmiller/docs/scalawksp11.pdf 1

3 Algorithm The algorithm choosen is an algorithm use in collaborative filtering called Alternating- Least-Squares with Weigthed- -regulation (ALS-WR). The important idea behind this algorithm is to discretize users and items into a vector of size n f, the bigger the size the more precise the algorithm will get. This discretization represent a fix numbers of criteria which should (idealy) fully describe a user or an item. And (if the model was perfect) we would only have to calculate the scalar product between a user and an item to find the rating this item would get from this user. The optimisation method is a Least-Squares approach, the mathematical developpement is done in details in the article Large-scale Parallel Collaborative Filtering for the Netflix Prize 3. Nethertheless we will quickly see the core devloppement ; every vector representing a user (u i ) or a movies (m j ) are put together to form the matrix U and M : u i R n f m j R n f U =[u i ] M =[m j ] We can then construct our least square rule, where r ij is a known rating and < u i, m j > is the scalar product between a user and a movie. L emp (R, U, M) = 1 n X (i,j)2i (r ij < u i, m j >) 2 This represent the Root Mean square error (RMSE). Now we can add the lambda constraint : L emp (R, U, M) =L emp (R, U, M)+ ku U k 2 + km M k 2 where U and M are choosen Tikhonov matrices. In our case, we will take one of simpler matrices possible: the identity matrix weighted by the number of user rated or movies rated : U = diag (n ui ) M = diag n mj To apply this least square optimisation we will use the following update steps for U and M, which will be applied once after another. 3 Yunhong Zhou. Dennis Wilkinson, Robert Schreiber and Rong Pan. Large-scale Parallel Collaborative Filtering for the Netflix Prize. HP Labs, 1501 Page Mill Rd, Palo Alto, CA, 94304 2

Updating U Updating M A i = M Ii M T I i + n ui E (1) V i = M Ii R T (i, I i ) (2) u i = A 1 i V i (3) A j = U Ij U T I j + n mi E (4) V j = U Ij R T (I j,j) (5) Where : m j = A 1 j V j (6) I i is the set of movie that the user i has rated n ui is the cardinality of I i E is the n f n f identity matrix M Ii is the sub-matrix of M where columns j 2 I i are selected R (i, I i ) is the row vector where columns j 2 I i of the i-th row of R is taken 3.1 Parallel processing The important things to understand for parallelizing in this algorithm is that there is three steps 1. Fixing the item, and solving the users 2. Fixing the user, and solving the item 3. Calculating the error Each of theses steps are dependante from one another, but the first two steps are about updating a huge matrix with independent opration on every line. So the parallelization will apply mainly during linear algebra computation (matrix multiplication, matrix inversion, multi equations solving). 4 Implementation 4.1 Dataset The first problem that arise is how to keep the dataset in memory. At first our data can be view as a huge list of Tuple3 containing : 1. A unique user id 2. A unique movie id 3. the rating from the user for this movie 3

Of course as we want to go as fast as possible we need a much more optimal data structure for our problem. For this we go back to the equations 1 and 4 and we see that during all the processing we will need to acces two kind of set for every user or movie: the set of all the movie rated by one user. the set of all user which have rated a specific movie. The best construct to acces theses two set, as faster as possible, is to create two object : the user map : Map[userId, Map[movieId, rating]] the movie map : Map[movieId, Map[userId, rating]] This is of course not optimal memory-wise since we dublicate all the information ; but this duplication give us almost O (n) time acces to any of the set from above. 4.2 Matrix 4.2.1 Multiplication As we seen in the equations 3 and 6 there is an intensive usage of linear algebra operation, thus we need to construct a Matrix class as fast as possible. The main time-consuming operation is from 1 and 4 where we have the product between two potentially huge matrix. Unfortunaly there is not much we can do about this not to become O n 3 ; the Strassen s algorithm (O n 2.807 ) require square matrix and in our case it would mean to apply a padding which could become very une cient as the size of the matrix is very variable. However we can use the fact that we are multiplying a matrix with same matrix tranposed which will always result in a square-hermitian matrix. This give us the possibility to calculate only half of the coe cient and thus we got O n 3 /2 operations. 4.2.2 Inversion Another important step of the algorithm are in the equations 3 and 6 where we have a matrix inversion. To calculate this our first choice was the LUdecomposition which is O 2n 3 /3. But as we seen before the two matrix from 1 and 4 are hermitian and thus we can use the Choelsky decomposition which is O n 3 /3. 4.3 Parallel Collections The implementation using the parallel collection is very simple since we got update step in which all the vector are updated independently. We only need to transform our Map into the parallel equivalent which can be done by writing the three letter Par before our collection and the trick is done. The update operation on the user or movies matrix which was written as a single foreach loop then becomes automatically parallel. 4

4.4 Menthor Since menthor is build around machine learning in graphs, we must adapt our algorithm to a graph. We start by collecting all the user into one set and all the movies into another one. Then we simply connect the pair of vertices user-movie if and only if the user rated the movies. With this consruct we can easily see that the constructed graph is indeed bipartite. Furthermore, in this graph representation we got only a few edges between the two sets (which are the given ratings). The goal is now to construct the fully connect bipartite graph with all the calculated ratings which minimize the error. Now that we got the vertices, we must set the Data type which our graph will use. In our case we need to be able to reduce all the graph during the crunch step representing the RMSE computation. To achieve this we choose the Data type as a Tuple2 of the vector representation of a user/movies and a Float value representing the square error. With this kind of Data type the crunch step become very easy since we will just have to sum all the local errors. Finally we must construct our update step without inversion of control. We do this using the then combinator, constructing five substeps : 1. the movies vertices send their (updated) values to the users vertices 2. the user vertices read their message, update their values and send them back to the movies 3. the movies vertices update their values using the one s sent by the users vertices 4. calculate the RMSE using the crunch step 5. the monitor vertex receive the result of the aggregration and use it (in our case it s just printed) These five substeps forms the single menthor step. 4.5 User s perspective Once the implementation completed, it is good to look back and compare the practicity of the two way to paralellize we used. First the parallel collection, their are very usefull and powerfull once you isolate an idependante set of operation to compute. In this project the idependente set was not very complicated to dig out and thus the parallel collection were simple to use. Also technically you only need to add a Par to your collection to make it parallel, which is very quick and easy to do. On the other hand if this extraction is not easy to see or even impossible, it quickly becomes very hard to use the parallel collection well. Second, Menthor, as a framework it is not as quick or simple to use it for the first time, but once your problem is transform in a graph problem, the tool of menthor becomes very practical. You can easily creat vertices, connect them and add them to the graph ; and when it s time to form the update steps the 5

then and crunch combinator becomes very handy. In this view menthor take more time to use, but its construction force more clarity in the steps, substeps process. 5 Benchmark 5.1 Protocol The benchmarking is done using the datasets from GroupLens research lab 4. We are using the two set 100k and 1m : 100k - 100,000 ratings from 1000 users on 1700 movies. 1M - 1 million ratings from 6000 users on 4000 movies. With these sets we are going to test the tree di erent version of our algorithm : 1. serial control version 2. parallel using parallel collections 3. parallel using Menthor For each one of theses, we are going to do three-pass run with the following number of hidden factor n f = 50, 100, 200, 300, 500 Then we will continue using only the two parallel version with n f = 700, 1000 For each of these run we collect the run time of the sum of the three steps ; each step containing three substeps : 1. updating the users matrix 2. updating the movies matrix 3. computing the error 5.2 Results The benchmarking were done on i7-920 (4 physical core with hyperthreading). We also try another machine with 8 physical cores and no hyperthreading, but the result were about the same. The first set of benchmarks were made on the small database (100k ratings) to get a general idea of how well the parallel collections and menthors would manage the algorithm, and to get an aproximation how the speedup we could hope for. 4 http://www.grouplens.org/node/73 6

As expected, the parallelization becomes usefull only once we reach a n f factor large enough. It is also interesting to see that whith the small database the di erence between the parallel collection and menthor is almost invisible. The speedup for this first part is shown in the next graph : We see that parallel collection are consistent in their speedup. Menthor, by is construction, use much more object and high level construct which, for very small value, diminishes the speedup. But even with average value (n f 300) menthor catch up to the parallel collections. We now add the remaining part with a higher hidden factor : 7

This time we are starting to see a gap beetween menthor and the parallel collections at the highest factor. The di erence in about 7% on average. We can see that already in the smaller dataset, we have some emerging tendencies. On the parallelization side, it still works pretty well with this algorithm, and for now the menthor framework is not much behind in terms of performances. To investigate this a little further we will now look at the next graph, which is from the bigger dataset (1m ratings) Now as we would expect this is very similar to the smaller dataset but this time we see a little more the di erence between menthor and the parallel collections. Let us see the speedup this time : 8

The same remarks as the increasing speedup on menthor applies, but this time it is more consistent. And finally we expand to the full graph of the 1m dataset : This last graph confirms two things : - the parallel collections as well as menthor seems consistently scalable, as theses benchmarks running time were between 8 secondes to 7 hours. - with this simple algorithm which use mainly basic operation (linear algebra) with very little abstraction, the parallels collections keep an 10% advantage in running time. 9

6 Conclusions The algorithm we choose was a really good test because it was ideal for parallel collections ; all the update can be done independently from one another and the weight of the update is about uniform. As we would have expected the parallel collection performed and scaled very consistently for a wide range of parameters. Furthermore they were easy to use and did not need specific tuning. The menthor framework, even so it was not the best context for it, still manage to be almost as e cient as the parallel collections and has a very good scaling too. However, one of the di culty was to understand the algorithm as a graph problem ; but this di culty is more of a theorical one. On the implementation side, once you find a good representation, the menthor construction feel natural with its easy vertex construct, vertex connection and of course the update steps which can be split into substeps. It feel more like flow programming with synchronous jump to the next step. Finally, if we want to go deeper in the menthor s capacity, we could try more complex algorithms or more intensive parallelization (increase number of core, distributed system or GPU parallelization). 10