Conflict Graphs for Parallel Stochastic Gradient Descent

Similar documents
Support vector machines

ECE 901: Large-scale Machine Learning and Optimization Spring Lecture 13 March 6

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Random Projection Features and Generalized Additive Models

Support Vector Machines.

Stanford University. A Distributed Solver for Kernalized SVM

ECS289: Scalable Machine Learning

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

SVM Optimization: An Inverse Dependence on Data Set Size

Introduction to Machine Learning

.. Spring 2017 CSC 566 Advanced Data Mining Alexander Dekhtyar..

Machine Learning Classifiers and Boosting

Parallelization in the Big Data Regime: Model Parallelization? Sham M. Kakade

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

6 Model selection and kernels

All lecture slides will be available at CSC2515_Winter15.html

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

More Data, Less Work: Runtime as a decreasing function of data set size. Nati Srebro. Toyota Technological Institute Chicago

Introduction to Machine Learning Spring 2018 Note Sparsity and LASSO. 1.1 Sparsity for SVMs

Support Vector Machines

CS 229 Midterm Review

Introduction to Support Vector Machines

BudgetedSVM: A Toolbox for Scalable SVM Approximations

SUPPORT VECTOR MACHINES

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Classification: Linear Discriminant Functions

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

SUPPORT VECTOR MACHINES

Chakra Chennubhotla and David Koes

CSE 546 Machine Learning, Autumn 2013 Homework 2

Lecture #11: The Perceptron

DM6 Support Vector Machines

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

LECTURE 5: DUAL PROBLEMS AND KERNELS. * Most of the slides in this lecture are from

Machine Learning: Think Big and Parallel

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Memory-efficient Large-scale Linear Support Vector Machine

SVM Optimization: Inverse Dependence on Training Set Size

Support Vector Machines

Lecture 7: Support Vector Machine

1 Case study of SVM (Rob)

Divide and Conquer Kernel Ridge Regression

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

3 Perceptron Learning; Maximum Margin Classifiers

Bagging for One-Class Learning

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2016

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Week 3: Perceptron and Multi-layer Perceptron

CS229 Lecture notes. Raphael John Lamarre Townshend

Support Vector Machines

Kernel-based online machine learning and support vector reduction

Generative and discriminative classification techniques

Cache-efficient Gradient Descent Algorithm

Support Vector Machines

Case Study 1: Estimating Click Probabilities

Lecture 1 Notes. Outline. Machine Learning. What is it? Instructors: Parth Shah, Riju Pahwa

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Louis Fourrier Fabien Gaie Thomas Rolf

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

A Brief Look at Optimization

Linear Models. Lecture Outline: Numeric Prediction: Linear Regression. Linear Classification. The Perceptron. Support Vector Machines

Learning via Optimization

Lecture 10: SVM Lecture Overview Support Vector Machines The binary classification problem

Logistic Regression and Gradient Ascent

Part-based Visual Tracking with Online Latent Structural Learning: Supplementary Material

Support Vector Machines.

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Perceptron: This is convolution!

Instance-based Learning

Convexization in Markov Chain Monte Carlo

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Data Driven Frequency Mapping for Computationally Scalable Object Detection

SVM in Analysis of Cross-Sectional Epidemiological Data Dmitriy Fradkin. April 4, 2005 Dmitriy Fradkin, Rutgers University Page 1

Kernel Methods & Support Vector Machines

Machine Learning for NLP

SUPPORT VECTOR MACHINE ACTIVE LEARNING

Asynchronous Parallel Stochastic Gradient Descent. A Numeric Core for Scalable Distributed Machine Learning Algorithms

Machine Learning based Compilation

Data mining with Support Vector Machine

Deep Learning and Its Applications

Kernel Methods. Chapter 9 of A Course in Machine Learning by Hal Daumé III. Conversion to beamer by Fabrizio Riguzzi

Linear models. Subhransu Maji. CMPSCI 689: Machine Learning. 24 February February 2015

Ensemble methods in machine learning. Example. Neural networks. Neural networks

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Facial Expression Classification with Random Filters Feature Extraction

Class 6 Large-Scale Image Classification

Classification with Partial Labels

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

A Short SVM (Support Vector Machine) Tutorial

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Constrained optimization

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

ECS289: Scalable Machine Learning

Transcription:

Conflict Graphs for Parallel Stochastic Gradient Descent Darshan Thaker*, Guneet Singh Dhillon* Abstract We present various methods for inducing a conflict graph in order to effectively parallelize Pegasos. Pegasos is a stochastic sub-gradient descent algorithm for solving the Support Vector Machine (SVM) optimization problem [3]. In particular, we introduce a binary treebased conflict graph that matches convergence of a wellknown parallel implementation of stochastic gradient descent, know as HOGWILD!, while speeding up training time. We measure these results on a real-world dataset in a binary classification setting. This allows us to run various experiments to compare iterations between Parallel Pegasos and traditional Pegasos in order to effectively measure classification accuracy. I. INTRODUCTION A Support Vector Machine (SVM) is a popular classification algorithm. It learns a hyperplane that linearly separates a group of points and is also as wide as possible. In a binary classification task, a halfspace created by the hyperplane corresponds to points in one label, whereas the other halfspace corresponds to points in the other label. A hyperplane is represented by the normal vector to the hyperplane, denoted as w. And the classification for a data point x is determined by its inner product with w. If the dot product is positive, x is labeled 1, and 1 otherwise. Additionally, there might be several hyperplanes that would be able to divide the training data linearly with a hyperplane. However, in order for the SVM to be generalizable, the algorithm picks the hyperplane that is as wide as possible. Wideness of a hyperplane is determined by the minimum distance of the hyperplane to all the data points. In other words, the boundary separating the two classes of data is picked in a way that the minimum distance to a data point from both the classes is the same. So while predicting a classification for a data point, the absolute value of its inner product is desired to be above a certain threshold. This leads to the hinge loss function for a given point x and its label y (while ignoring the bias term): l(w; (x, y)) = max {0, 1 y w, x } Here, u, v denotes an inner product between u and v. The width is proportional to 1/ w 2. * These authors contributed equally to this report. In order to maximize the width, the norm of w needs to be minimized. This can be incorporated as a regularization term, in addition to the hinge loss of each data point. Formally, given a training set of X = {(x 1, y 1 ),..., (x m, y m )}, the optimization problem for an SVM is: min w 2 w 2 2 + 1 m (x,y) X l(w; (x, y)) In the above problem, is the regularization parameter and l(w; (x, y)) is the hinge-loss function. The Pegasos algorithm solves the primal objective function above using a vanilla stochastic sub-gradient descent method. II. PEGASOS In stochastic sub-gradient descent, at each iteration, the objective function considered is that for a particular data point picked for that iteration. Initializing w 0 = 0, at iteration t, a random data point, (x it, y it ), is picked, where index i t is in the set {1,..., m}. The objective function to be considered for this data point is: 2 w t 2 2 + 1 m l(w t; (x it, y it )) Calculating the sub-gradient of the above objective function, we obtain: t = w t 1[y it w t, x it < 1]y it x it where 1[y it w t, x it < 1] takes the value 1 if the condition is met, and 0 otherwise. The update for w is: w t+1 = w t η t t At the end of T iterations, w T +1 is the required normal to the hyperplane that defines the SVM. The pseudo-code for the algorithm is given in Algorithm 1. 1

Algorithm 1 Pegasos Given: X = {(x 1, y 1 ),..., (x m, y m )},, T Initialize: Set w 0 = 0 for t = 1, 2,..., T do Choose i t {1,..., m} uniformly at random Set η t = 1 t if y it w t, x it < 1 then Set w t+1 (1 η t )w t + η t y it x it else Set w t+1 (1 η t )w t end if end for return w T +1 A. Mini-Batch Instead of considering just one data point at every iteration, k data points can be considered in order to give us a better estimate to the actual gradient of the objective function. At iteration t, a set of k indices, A t is chosen at random. The objective function to be considered for the data point with these indices becomes: 2 w t 2 2 + 1 l(w t ; (x i, y i )) m i A t Calculating the sub-gradient of the above objective function, we obtain: t = w t The update for w is: i A t 1[y it w t, x i < 1]y i x i w t+1 = w t η t t At the end of T iterations, w T +1 is the required normal to the hyperplane that defines the SVM. The pseudo-code for the algorithm is given in Algorithm 2. Algorithm 2 Mini-Batch Pegasos Given: X = {(x 1, y 1 ),..., (x m, y m )},, T, k Initialize: Set w 0 = 0 for t = 1, 2,..., T do Choose A t {1,..., m} uniformly at random, where A t = k Set A + t = {i A t : y i w t,, x i < 1} Set η t = 1 t Set w t+1 (1 η t )w t + ηt end for return w T +1 k i A + t y i x i III. PARALLELIZING STOCHASTIC GRADIENT DESCENT Various approaches have been proposed to parallelize stochastic gradient descent (SGD) methods.this problem is inherently serial, since the parameter vector, or w, of one iteration depends directly upon on the parameter vector of the previous iteration. Looking at how Mini-Batch works, it is easy to parallelize the algorithm by creating k threads in each iteration, where each thread gets a data point and it finds the update for the parameter vector. The updates for each iteration can be summed up once each thread is done to get the update for the parameter vector for that iteration. This though, is not a very big speed-up. One notable approach for parallelizing SGD is the HOGWILD! approach [2], or the lock-free approach. In this algorithm, a thread operates on a shared-memory parameter vector without using any locks. This means that reads and writes can happen concurrently, and one thread may be modifying the parameter vector, while another thread reads an inconsistent vector for updates. However, the success of this method is reliant upon a measure of sparsity and randomness, ensuring that these inconsistent reads and writes do not affect the convergence drastically. In fact, when the input data is extremely sparse, this method performs very well and converges fast. Another approach for parallelizing SGD, which is a more recent approach, is the CYCLADES algorithm [1]. In this algorithm, we focus on picking indices of the data points to use for updating the gradients. The key observation here is that inconsistent reads/writes occur when two data points have overlapping supports, or equivalently, these two data points have a conflict. In this scenario, sparsity is helpful, since it is easier to find two data points that do not have overlapping supports. Suppose we wished to pick k indices of the data points to use for updating the gradients. The contribution of CYCLADES is to randomly sample k indices and model the conflicts between these k indices as a conflict graph. Thus, we connect two indices if there is a conflict between them. Then, a connected component of the conflict graph represents a set of indices that can safely be updated in parallel on a core. This gives us an estimate for the number of cores we need to perform a SGD mini-batch update with batch size k, and gives us a set of indices that can be allocated to each core. The contribution of this report borrows ideas from the above algorithms, but ultimately uses a different 2

approach to select non-conflicting indices to successfully parallelize stochastic sub-gradient descent in the context of SVMs. In particular, we note that HOGWILD! has no constraints on the indices that are picked, whereas CYCLADES imposes many constraints, so in theory, we could have many more connected components in a set of indices than cores available. Thus, the motivation behind our approach is that given a fixed number of cores, we would like a way to return a set of indices that can be updated on different threads. A. Induced Conflict Graph Generation We consider four techniques for generating sets of indices that may or may not have a conflict. Suppose we wish to pick a set of unique indices A where A = k. The conflict graph G = (V, E) conceptually will have V = k nodes and some number of edges between nodes, where an edge between two nodes denotes a conflict between the two nodes (i.e. an overlap in the features). At a high level, each method below induces a conflict graph and enforces some invariants on the number of edges E in this graph. It is important to note that the conflict graph is never explicitly generated, but only used as a way to reason about various sets of indices and their effects on convergence rates. Each method is an iterative algorithm that starts off with a random index and builds up an induced conflict graph one index at a time. 1) This approach will enforce the invariant that every node will have an edge with at least one node. A simple approach to enforce this invariant is the algorithm that adds a new index to the set of indices if it conflicts with at least one of the previously generated indices. However, to reduce the lookup time of conflicts with all previous nodes, which requires O( V ) time, we only check conflicts with O(log ( V )) nodes. The goal of this is twofold. First, this approach introduces more stochasticity into the set of indices since we are only checking a small subset of previous indices for conflicts and discarding if there are no conflicts with these indices. Secondly, this approach improves lookup time for estimating the conflicts that a particular index has with previously generated indices. Note that it is possible for this method to discard more indices than necessary, if they do not happen to conflict with their ancestors. However, for large values of k, this method will, in expectation, find a set of valid indices faster than an approach that scans all previously generated nodes. To implement this, it is beneficial to visualize a set of indices as a binary tree. We wish to enforce that every index conflicts with at least one of its ancestors, since the number of ancestors is logarithmic with respect to the total number of vertices. The iterative algorithm conceptually creates a binary tree in a level-order traversal and checks conflicts with all of the ancestors of a given node. 2) This approach enforces the invariant that every node must conflict with every other node. In particular, we build a complete conflict graph. 3) This approach uses a similar binary tree-based approach, however enforcing invariants on the number of non-edges in the graph, or the number of nodes that do not conflict. In this algorithm, we ensure that an index does not conflict with at least one of the ancestors in its binary tree representation. 4) This approach enforces the invariant that every node must not conflict with any other node. In particular, we build an empty conflict graph with no edges. Observe that Methods 1 and 3 above attempt to improve sampling time without negatively affecting the convergence rate of Parallel Pegasos. A. Dataset IV. EXPERIMENTS We ran our evaluations and collected performance metrics on the CCAT data set. It is used for a classification task taken from the Reuters RCV1 collection. B. Plots Data Points = 804414 Training Size = 723973 Testing Size = 80441 Features = 47236 Sparsity = 0.16% We note three statistics on a single run of a method. These statistics are: The loss function value on the testing set for each iteration ( = 0). The testing set percentage error rate for each iteration. The norm of the parameter vector for each iteration. The different Parallel Pegasos methods vary based on the conflict graph generation. In particular, there are 4 Parallel Pegasos methods, corresponding directly to the 4 conflict graphs mentioned in Section III. These three statistics were run on two values of k (k = 2 and k = 10), which represents the mini-batch 3

Fig. 1. Loss Function for k = 2 Fig. 4. Loss Function k = 10 Fig. 2. Norm of Parameter Vector for k = 2 Fig. 5. Norm of Parameter Vector for k = 10 Fig. 3. Percent Error for k = 2 Fig. 6. Percent Error for k = 10 4

size or the number of indices to sample for a conflict graph. These runs were for 1000 iterations, and a value of 10 4, borrowed from [3]. However, these figures do not capture the timing aspect of the various methods, which is described in Table I. Note that these timings include calculating the statistics presented in this report, per iteration. TABLE I ALGORITHM TIMINGS points are raised to can be infinite (Gaussian Kernels are examples of that). REFERENCES [1] X. Pan, M. Lam, S. Tu, D. Papailiopoulos, C. Zhang, M. I. Jordan, K. Ramchandran, C. Re, and B. Recht. Cyclades: Conflict-free asynchronous machine learning. arxiv preprint arxiv:1605.09721, 2016. [2] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693 701, 2011. [3] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3 30, 2011. k 2 10 Mini-Batch 641.188916922 541.698472977 Parallel1 437.151793003 439.413945913 Parallel2 455.587590933 441.605366945 Parallel3 486.384338856 453.869709969 Parallel4 462.699862957 6521.923002 HOGWILD! 565.418184042 601.219654083 Note that the Parallel Pegasos algorithms that use a binary-tree based algorithm are much faster than the mini-batch and HOGWILD approaches, while not sacrificing significant decreases in the percentage error rate, especially for the larger values of k. In fact, conflict graph 1 performs extremely well and almost matches the performance of the best method. When running conflict graph 4 for large values of k, Parallel Pegasos takes an extremely long amount of time, which is expected since generating nodes without any conflicts will take numerous iterations. This is another reason why the binary-tree based conflict graph performs well, since it does not take many iterations to keep a high convergence rate. V. CONCLUSION AND FUTURE WORK Various methods were presented on how to get indices to parallelize Pegasos. Theoretically, inducing conflicts to parallelize should not work. However, we observed that inducing certain kinds of conflicts improved the algorithm. Some important questions to be asked here are why are such algorithms out-performing algorithms with no conflicts? Also, can there be any theoretical guarantees on such algorithms? We are not certain as to why these results so closely match performance of HOGWILD!. In general, SVMs go hand-in-hand with Kernel methods, that help classify data that is not linearly separable in the dimension space that they are provided in. The introduction of a Kernel method only changes the objective function by substituting the inner product (Kernel trick). However, parallelizing the objective function then becomes very hard as the dimensions that the data 5