Model compression as constrained optimization, with application to neural nets

Similar documents
Learning independent, diverse binary hash functions: pruning and locality

Distributed Optimization of Deeply Nested Systems

The K-modes and Laplacian K-modes algorithms for clustering

Locally Linear Landmarks for large-scale manifold learning

Hashing with Binary Autoencoders

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Neural Network Neurons

Optimization for Machine Learning

Can we quantify the hardness of learning manipulation? Kris Hauser Department of Electrical and Computer Engineering Duke University

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

CNN optimization. Rassadin A

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

1 Case study of SVM (Rob)

High Dimensional Indexing by Clustering

MoonRiver: Deep Neural Network in C++

Learning Discrete Representations via Information Maximizing Self-Augmented Training

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

REGRESSION ANALYSIS : LINEAR BY MAUAJAMA FIRDAUS & TULIKA SAHA

All lecture slides will be available at CSC2515_Winter15.html

CS 179 Lecture 16. Logistic Regression & Parallel SGD

THE MNIST DATABASE of handwritten digits Yann LeCun, Courant Institute, NYU Corinna Cortes, Google Labs, New York

Novel Lossy Compression Algorithms with Stacked Autoencoders

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Deep Learning With Noise

Deep Generative Models Variational Autoencoders

Lecture 19: Generative Adversarial Networks

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

arxiv: v1 [cs.lg] 9 Jan 2019

Trajectory Inverse Kinematics By Conditional Density Models

Optimization. Industrial AI Lab.

3 Nonlinear Regression

Neural Networks and Deep Learning

Convolutional Neural Networks

Decentralized and Distributed Machine Learning Model Training with Actors

Perceptron as a graph

Learning Articulated Skeletons From Motion

DEEP LEARNING OF COMPRESSED SENSING OPERATORS WITH STRUCTURAL SIMILARITY (SSIM) LOSS

Unsupervised Learning: Clustering

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

ECS289: Scalable Machine Learning

Optimization III: Constrained Optimization

5 Machine Learning Abstractions and Numerical Optimization

Deep Model Compression

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Optimization of Bit Rate in Medical Image Compression

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

3 Nonlinear Regression

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Unified Deep Learning with CPU, GPU, and FPGA Technologies

A Very Low Bit Rate Image Compressor Using Transformed Classified Vector Quantization

CSE 559A: Computer Vision

D-Optimal Designs. Chapter 888. Introduction. D-Optimal Design Overview

Slides for Data Mining by I. H. Witten and E. Frank

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Edge linking. Two types of approaches. This process needs to be able to bridge gaps in detected edges due to the reason mentioned above

Programming, numerics and optimization

x 1 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) 75 x y=20

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

Binary Convolutional Neural Network on RRAM

Scaling Distributed Machine Learning

Image Compression with Competitive Networks and Pre-fixed Prototypes*

A Technical Report about SAG,SAGA,SVRG

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

A Course in Machine Learning

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Data Mining Chapter 8: Search and Optimization Methods Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Computer Vision 2. SS 18 Dr. Benjamin Guthier Professur für Bildverarbeitung. Computer Vision 2 Dr. Benjamin Guthier

Perceptron: This is convolution!

Early Stopping. Sargur N. Srihari

Efficient Iterative Semi-supervised Classification on Manifold

Today. Golden section, discussion of error Newton s method. Newton s method, steepest descent, conjugate gradient

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

06: Logistic Regression

Efficient Algorithms may not be those we think

IE598 Big Data Optimization Summary Nonconvex Optimization

Pose estimation using a variety of techniques

Aerial Platform Placement Algorithm to Satisfy Connectivity and Capacity Constraints in Wireless Ad-hoc Networks

arxiv: v1 [cs.cv] 26 Aug 2016

AM205: lecture 2. 1 These have been shifted to MD 323 for the rest of the semester.

Direct Matrix Factorization and Alignment Refinement: Application to Defect Detection

BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

Network Traffic Measurements and Analysis

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CS 229 Midterm Review

Overall Description. Goal: to improve spatial invariance to the input data. Translation, Rotation, Scale, Clutter, Elastic

Machine Learning / Jan 27, 2010

Hybrid ART/Kohonen Neural Model for Document Image Compression. Computer Science Department NMT Socorro, New Mexico {hss,

Countering Adversarial Images using Input Transformations

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Neural Networks for unsupervised learning From Principal Components Analysis to Autoencoders to semantic hashing

Reversible Wavelets for Embedded Image Compression. Sri Rama Prasanna Pavani Electrical and Computer Engineering, CU Boulder

Deep Learning with Tensorflow AlexNet

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Real-time Object Detection CS 229 Course Project

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

Transcription:

Model compression as constrained optimization, with application to neural nets Miguel Á. Carreira-Perpiñán and Yerlan Idelbayev Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

Why neural net compression? Deep learning very successful in some practical applications computer vision, speech, NLP, etc. Very good classification accuracy if training large, deep neural nets on large datasets (with several GPUs over several days... ). It is of interest to deploy high-accuracy deep nets on limited-computation devices (mobile phones, IoT, etc.), but this requires compressing the deep net to meet the device s limitations in memory, runtime, energy, bandwidth, etc. Surprisingly high compression rates often possible because deep nets are vastly over-parameterized in practice. It seems this makes it easier to learn a high-accuracy net. Still, lossy compression will degrade the accuracy, so we want to compress optimally. p. 1

Related work on neural net compression Many papers in the last 2 3 years, although neural net compression was already studied in the 1980 90s. Some common approaches: Direct compression: train a reference net, compress its weights using standard compression techniques. quantization (binarization, low precision), pruning weights/neurons, low-rank, etc. Simple and fast, but it ignores the loss function, so the accuracy deteriorates a lot for high compression rates. Embedding the compression mechanism in the backprop training. remove neuron/weight values with low magnitude on the fly binarize or round weight values in the backprop forward pass Often no convergence guarantees. Other ad-hoc algorithms proposed for specific compression techniques. Multiple compression techniques can be combined. p. 2

Desiderata for a good solution to the problem Firstly, we need a precise, general mathematical definition of the problem of model compression that is amenable to numerical optimization techniques. Intuitively, we want to achieve the lowest loss possible for a given compression technique and compression rate. Then, we would like an algorithm that: is generic, applicable to many compression techniques so the user can easily try different types of compression has optimality guarantees inasmuch as possible is easy to integrate in existing deep learning toolboxes is efficient in training. p. 3

An illustrative example: low-rank compression Consider a linear regression problem with weight matrix W d D : min L(W) = N y n Wx n 2 W n=1 We want W = UV T with U d r, V D r and r < min(d,d). Hence the problem is (RRR, reduced-rank regression): y n UV T x n 2 min L(U,V) = N U,V n=1 This could be solved in various ways, more or less convenient using gradient-based optimization using alternating optimization over U and V in fact, there is a closed-form solution as an eigenproblem. ( and orthog. constr. on U or V But we want a more generally applicable algorithm. Instead, introduce auxiliary variables W = UV T and define as a constrained problem: min L(W) = N ( ) y n Wx n 2 s.t. W = UV T and orthog. constr. W,U,V n=1 on U or V This separates the loss part from the compression part. ) p. 4

Model compression as constrained optimization General formulation: uncompressed weights low-dim. params w-space (uncompressed models) w (optimal compressed) min w,θ L(w) s.t. w task loss = (Θ) decompression mapping feasible models C (decompressible by ) Compression and decompression are usually seen as algorithms, but here we regard them as mathematical mappings in parameter space. The details of the compression technique are abstracted in (Θ). w (reference) (Θ DC ) (direct compression) feasible set C = {w R P : w = (Θ) for Θ R Q } p. 5

The learning-compression (LC) algorithm Use a penalty method (e.g. quadratic penalty): as µ, minimize: Q(w,Θ;µ) = L(w)+ µ 2 w (Θ) 2 using alternating optimization over w and Θ: L step: min w L(w)+ µ 2 w (Θ) 2 independent of the compression technique C step: min Θ w (Θ) 2 independent of the loss and dataset. This algorithm converges to a local stationary point of the constrained problem under standard assumptions as µ smooth loss L(w) and decompression mapping (Θ) sufficiently accurate optimizations on L and C steps. Generic: one algorithm, many compression techniques. p. 6

LC algorithm: solving the L step The L step always takes the same form regardless of the compression technique: min w L(w)+ µ 2 w (Θ) 2 This is the original loss on the uncompressed weights w, regularized with a quadratic term on the weights (which decay towards a validly compressed model (Θ)). It can be optimized in the same way as the reference net. Simply add µ(w (Θ)) to the gradient. With large datasets we typically use SGD. We clip the learning rates so they never exceed 1 µ to avoid oscillations as µ. The compression technique appears only in the C step. p. 7

LC algorithm: solving the C step Solving the C step means optimally compressing the current weights w: min Θ w (Θ) 2 Fast: it does not require the dataset (which appears in L(w)). Equivalent to orthogonal projection Θ = Π(w) on the feasible set. Find the closest compressed model to w. The solution depends on the choice of the decompression mapping, and is known for many common compression techniques: low rank: SVD binarization: binarize weights quantization in general: k-means pruning: zero all but top weights etc. So the C step simply requires calling the corresponding compression subroutine as a black box. p. 8

Example C step: quantization with adaptive codebook The C step solution is derived from the form of the decompression mapping : Θ w R P. In quantization, each real valued weight w i must equal one of the K values in a codebook C = {c 1,...,c K } R. Compressed net: codebook (K floats) + log 2 K bits per weight. The decompression mapping is a table lookup w i = c κ(i) where κ: {1,...,P} {1,...,K} is a discrete mapping that assigns each weight to one codebook entry. Rewriting κ() using binary assignment variables Z {0,1} P K : P { min w i c κ(i) 2 P,K K min z ik w i c k 2 s.t. k=1 z ik = 1 C,κ p=1 C,Z i,k=1 i = 1,...,P So Θ = {C,Z}, and the decompression mapping is w i = K k=1 z ikc k. This is the squared distortion problem. It is NP-complete. Approximate solution: k-means. p. 9

Experimental results: quantization More compression for the same target loss than other algorithms. LeNet neural nets on MNIST dataset: nearly no error degradation using K = 2 (1 bit/weight, compression ratio 30.5). Error-compression tradeoff curves: training loss L 10-2 10-4 compression ratio ρ 30.5 15.6 10.5 7.9 6.3 5.3 test error Etest (%) compression ratio ρ 30.5 15.6 10.5 7.9 6.3 5.3 20 15 10 5 DC idc LC LeNet300 LeNet5 2 4 8 16 32 64 codebook size K 0 2 4 8 16 32 64 codebook size K VGG-like net (14M parameters, 12 layers) on CIFAR10 using K = 2 (compression ratio 32). Error: reference 13.15%, compressed 13.05%. p. 10

Conclusion A precise mathematical formulation of neural net compression as a constrained optimization problem. A generic way to solve it, the learning-compression (LC) algorithm: The compression technique appears as a black-box subroutine in the C step, independent of the loss and dataset. We can try a different type of compression by simply calling its subroutine. The task loss and neural net training by SGD appear in the L step, independent of the compression technique. As if we were training the original, reference net with a weight decay. Guaranteed to converge to local optimum under std assumptions. Easy to implement in deep learning toolboxes. Very effective in practice. We are developing this framework for number of compression types. See papers in arxiv Model compression as constrained optimization, with application to neural nets : 1707.01209, 1707.04319, more coming. Partly supported by NSF award IIS 1423515 and an NVIDIA GPU donation. p. 11