Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Similar documents
Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

Floating Point Arithmetic

Floating Point Numbers

Floating Point Numbers

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

1 Overview, Models of Computation, Brent s Theorem

Linear Regression Optimization

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Floating Point : Introduction to Computer Systems 4 th Lecture, May 25, Instructor: Brian Railing. Carnegie Mellon

A Brief Look at Optimization

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

Kinds Of Data CHAPTER 3 DATA REPRESENTATION. Numbers Are Different! Positional Number Systems. Text. Numbers. Other

COMP2611: Computer Organization. Data Representation

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Floating Point Numbers

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Scientific Computing. Error Analysis

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Deep Neural Networks Optimization

Numerical Precision. Or, why my numbers aren t numbering right. 1 of 15

Numerical computing. How computers store real numbers and the problems that result

CS281 Section 3: Practical Optimization

Fundamental CUDA Optimization. NVIDIA Corporation

COSC 311: ALGORITHMS HW1: SORTING

CafeGPI. Single-Sided Communication for Scalable Deep Learning

System Programming CISC 360. Floating Point September 16, 2008

Effectively Scaling Deep Learning Frameworks

CS 6453: Parameter Server. Soumya Basu March 7, 2017

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

How Learning Differs from Optimization. Sargur N. Srihari

ECE 5775 (Fall 17) High-Level Digital Design Automation. Fixed-Point Types Analysis of Algorithms

CS 6210 Fall 2016 Bei Wang. Lecture 4 Floating Point Systems Continued

CS429: Computer Organization and Architecture

Lecture 37: ConvNets (Cont d) and Training

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Matthieu Lefebvre Princeton University. Monday, 10 July First Computational and Data Science school for HEP (CoDaS-HEP)

Data Representation Floating Point

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Caching Basics. Memory Hierarchies

Accuracy versus precision

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Computer Architecture Review. Jo, Heeseung

Memory. Objectives. Introduction. 6.2 Types of Memory

Lecture 13: (Integer Multiplication and Division) FLOATING POINT NUMBERS

Data Representation Floating Point

Machine Learning 13. week

Block Size Tradeoff (1/3) Benefits of Larger Block Size. Lecture #22 Caches II Block Size Tradeoff (3/3) Block Size Tradeoff (2/3)

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Foundations of Computer Systems

Fundamental CUDA Optimization. NVIDIA Corporation

Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

Floating point. Today. IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Cache introduction. April 16, Howard Huang 1

Floating Point January 24, 2008

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

CS321 Introduction To Numerical Methods

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

ECS289: Scalable Machine Learning

Written Homework 3. Floating-Point Example (1/2)

MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic

Giving credit where credit is due

Number Representations

Computer Architecture Chapter 3. Fall 2005 Department of Computer Science Kent State University

Giving credit where credit is due

Lecture 10. Floating point arithmetic GPUs in perspective

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

ECS289: Scalable Machine Learning

Floating Point. CSE 238/2038/2138: Systems Programming. Instructor: Fatma CORUT ERGİN. Slides adapted from Bryant & O Hallaron s slides

Module 2: Computer Arithmetic

Approximating Square Roots

Dr. Chuck Cartledge. 3 June 2015

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

Heterogenous Computing

www-inst.eecs.berkeley.edu/~cs61c/

Notes on Multilayer, Feedforward Neural Networks

CO Computer Architecture and Programming Languages CAPL. Lecture 15

CUDA Performance Optimization. Patrick Legresley

Chapter 3: Arithmetic for Computers

Floating Point Puzzles The course that gives CMU its Zip! Floating Point Jan 22, IEEE Floating Point. Fractional Binary Numbers.

Recap: Machine Organization

1/25/12. Administrative

UCB CS61C : Machine Structures

Representing and Manipulating Floating Points. Jo, Heeseung

Up next. Midterm. Today s lecture. To follow

Representing and Manipulating Floating Points

COMP 3221: Microprocessors and Embedded Systems

Transcription:

Memory Bandwidth and Low Precision Computation CS6787 Lecture 10 Fall 2018

Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing iterations But machine learning systems need to process huge amounts of data Need to store, update, and transmit this data As a result: memory is of critical importance Many applications are memory-bound

Memory: The Simplified Picture Compute RAM

Memory: The Multicore Picture Compute L1 cache L2 cache Compute L1 cache L2 cache L3 cache RAM Compute L1 cache L2 cache

Memory: The Multisocket Picture Compute L1 cache L2 cache Compute L1 cache L2 cache L3 cache RAM Compute L1 cache L2 cache Compute L1 cache L2 cache Compute L1 cache L2 cache L3 cache RAM Compute L1 cache L2 cache

Memory: The Distributed Picture GPU Network GPU

What can we learn from these pictures? Many more memory boxes than compute boxes And even more as we zoom out Memory has a hierarchical structure Locality matters Some memory is closer and easier to access than others Also have standard concerns for CPU cache locality

What limits us? Memory capacity How much data can we store locally in RAM and/or in cache? Memory bandwidth How much data can we load from some source in a fixed amount of time? Memory locality Roughly, how often is the data that we need stored nearby? Power How much energy is required to operate all of this memory?

One way to help: Low-Precision Computation

Low-Precision Computation Traditional ML systems use 32-bit or 64-bit floating point numbers But do we actually need this much precision? Especially when we have inputs that come from noisy measurements Idea: instead use 8-bit or 16-bit numbers to compute Can be either floating point or fixed point On an FPGA or ASIC can use arbitrary bit-widths

Low Precision and Memory Major benefit of low-precision: uses less memory bandwidth Precision in DRAM Memory Throughput 64-bit float vector F64 F64 F64 5 numbers/ns 32-bit float vector F32 F32 F32 F32 F32 F32 10 numbers/ns 16-bit int vector 20 numbers/ns 8-bit int vector 40 numbers/ns (assuming ~40 GB/sec memory bandwidth)

Low Precision and Memory Major benefit of low-precision: takes up less space Precision in DRAM Cache Capacity 64-bit float vector F64 F64 F64 4 M numbers 32-bit float vector F32 F32 F32 F32 F32 F32 8 M numbers 16-bit int vector 16 M numbers 8-bit int vector 32 M numbers (assuming ~32 MB cache)

Low Precision and Parallelism Another benefit of low-precision: use SIMD instructions to get more parallelism on CPU SIMD Precision SIMD Parallelism 64-bit float vector F64 F64 F64 F64 32-bit float vector F32 F32 F32 F32 F32 F32 F32 F32 16-bit int vector 8-bit int vector 4 multiplies/cycle (vmulpd instruction) 8 multiplies/cycle (vmulps instruction) 16 multiplies/cycle (vpmaddwd instruction) 32 multiplies/cycle (vpmaddubsw instruction)

Low Precision and Power Low-precision computation can even have a super-linear effect on energy float32 int16 float32 float32 multiplier int16 int16 mul Memory energy can also have quadratic dependence on precision algorithm runtime float32 float32 memory

Effects of Low-Precision Computation Pros Fit more numbers (and therefore more training examples) in memory Store more numbers (and therefore larger models) in the cache Transmit more numbers per second Compute faster by extracting more parallelism Use less energy Cons Limits the numbers we can represent Introduces quantization error when we store a full-precision number in a lowprecision representation

Ways to represent low-precision numbers

FP16/Half-precision floating point 16-bit floating point numbers 1-bit sign 5-bit exponent 10-bit significand Usually, the represented value is x =( 1) sign bit 2 exponent 15 1.significand 2

Arithmetic on half-precision floats Complicated Has to handle adding numbers with different exponents and signs To be efficient, needs to be supported in hardware Inexact Operations can experience overflow/underflow just like with more common floating point numbers, but it happens more often Can represent a wide range of numbers Because of the exponential scaling

Half-precision floating point support Supported on some modern GPUs Including new efficient implementation on NVIDIA Pascal GPUs Good empirical results for deep learning

Fixed point numbers p + q + 1 bit fixed point number 1-bit sign p-bit integer part q-bit fractional part The represented number is x =( 1) sign bit integer part +2 q fractional part =2 q whole thing as signed integer

Example: 8-bit fixed point number It s common to want to represent numbers between -1 and 1 To do this, we can use a fixed point number with all fractional bits 1-bit sign 7-bit fractional part If the number as an integer is k, then the represented number is x =2 7 k 2 1, 127 128,..., 1 128, 0, 1 128,...,126 128, 127 128

More generally: scaled fixed point numbers Sometimes we don t want the decimal point to lie between two bits that we are actually storing We might want more tight control over what our bits mean Idea: pick a real-number scale factor s, then let integer k represent x = s k This is a generalization of traditional fixed point, where s =2 #offractionalbits

Arithmetic on fixed point numbers Simple Can just use preexisting integer processing units Mostly exact Underflow impossible Overflow can happen, but is easy to understand Can always convert to a higher-precision representation to avoid overflow Can represent a much narrower range of numbers than float

Example: Exact Fixed Point Multiply When we multiply two integers, if we want the result to be exact, we need to convert to a representation with more bits For example, if we take the product of two 8-bit numbers, the result should be a 16-bit number to be exact. Why? 100 x 100 = 10000 which can t be stored as an 8-bit number To have exact fixed point multiply, we can do the same thing Since fixed-point operations are just integer operations behind the scenes

Support for fixed-point arithmetic Anywhere integer arithmetic is supported CPUs, GPUs Although not all GPUs support 8-bit integer arithmetic And AVX2 does not have all the 8-bit arithmetic instructions we d like Particularly effective on FPGAs and ASICs Where floating point units are costly Sadly, very little support for other precisions 4-bit operations would be particularly useful

Custom Quantization Points Even more generally, we can just have a list of 2 b numbers and say that these are the numbers a particular low-precision string represents We can think of the bit string as indexing a number in a dictionary Gives us total freedom as to range and scaling But computation can be tricky Some recent research into using this with hardware support The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning (Zhang et al 2017)

Recap of low-precision representations Half-precision floating-point Complicated arithmetic, but good with hardware support Difficult to reason about overflow and underflow Better range No 8-bit support as of yet Fixed-point Simple arithmetic, supported wherever integers are Easy to reason about overflow, but has worse range Supports 8-bit and 16-bit arithmetic, but limited 4-bit support

Low-Precision SGD

Recall: SGD update rule w t+1 = w t t rf(w t ; x t,y t ) There are a lot of numbers we can make low-precision here We can quantize the input dataset x, y We can quantize the model w We can try to quantize within the gradient computation itself We can try to quantize the communication among the parallel workers

Several Broad Classes of Numbers For SGD on deep-learning-like objectives Dataset numbers used to store the immutable input data Model/Weight numbers used to represent the vector we are updating Numbers used to store gradients as we are computing them Activation numbers: used to compute/store the forward pass/loss of a neural network Error numbers: used to compute the gradients in the backwards pass Gradient numbers: used to store the gradients as they are computed Communication numbers used to communicate among parallel workers De Sa et al 2017, Wu et al 2018

Several Broad Classes of Numbers For SGD on deep-learning-like objectives A activations A activations Input data D D Linear layer E errors Non-linear layer E errors M Weights M G gradients C communication Other worker machines

Quantize classes independently Using low-precision for different number classes has different effects on throughput. Quantizing the dataset numbers improves memory capacity and overall training example throughput Quantizing the model/error numbers improves cache capacity and saves on compute Quantizing the gradient/activation numbers saves compute Quantizing the communication numbers saves on expensive inter-worker memory bandwidth

Quantize classes independently Using low-precision for different number classes has different effects on statistical efficiency and accuracy. Quantizing the dataset numbers means you re solving a different problem Quantizing the model numbers adds noise to each gradient step, and often means you can t exactly represent the solution Quantizing the gradient/activation/error numbers can add quantization errors to each gradient step Quantizing the communication numbers can add errors which cause workers local models to diverge, which can slow down convergence

Theoretical Guarantees for Low Precision Reducing precision adds noise in the form of round-off error. Two approaches to rounding: biased rounding round to nearest number unbiased rounding round randomly:! " # = # Using this, we can prove guarantees that SGD works with a low precision model. Taming the Wild [NIPS 2015] In some of my work, proved we 30% can combine 70% low-precision computation with asynchronous execution, which we call BUCKWILD! 2.0 2.7 3.0

Why unbiased rounding? Imagine running SGD with a low-precision model with update rule w t+1 = Q (w t t rf(w t ; x t,y t )) Here, Q is an unbiased quantization function In expectation, this is just gradient h descent i E[w t+1 w t ]=E Q (wt t rf(w t ; x t,y t )) w t = E [w t t rf(w t ; x t,y t ) w t ] = w t t rf(w t )

Doing unbiased rounding efficiently We still need an efficient way to do unbiased rounding Pseudorandom number generation can be expensive E.G. doing C++ rand or using Mersenne twister takes many clock cycles Empirically, we can use very cheap pseudorandom number generators And still get good statistical results For example, we can use XORSHIFT which is just a cyclic permutation

Benefits of Low-Precision Computation From https://devblogs.nvidia.com/parallelforall/mixed-precision-programming-cuda-8/

Memory Locality and Scan Order

Memory Locality: Two Kinds Memory locality is needed for good cache performance Temporal locality Frequency of reuse of the same data within a short time window Spatial locality Frequency of use of data nearby data that has recently been used Where is there locality in stochastic gradient descent?

Problem: no dataset locality across iterations The training example at each iteration is chosen randomly Called a random scan order Impossible for the cache to predict what data will be needed w t+1 = w t t rf(w t ; x t,y t ) Idea: process examples in the order in which they are stored in memory Called a systematic scan order or sequential scan order Does this improve the memory locality?

Random scan order vs. sequential scan order Much easier to prove theoretical results for random scan But sequential scan has better systems performance In practice, almost everyone uses sequential scan There s no empirical evidence that it s statistically worse in most cases Even though we can construct cases where using sequential scan does harm the convergence rate

Other scan orders Shuffle-once, then sequential scan Shuffle the data once, then systematically scan for the rest of execution Statistically very similar to random scan at the state Random reshuffling Randomly shuffle on every pass through the data Believed to be always at least as good as both random scan and sequential scan But no proof that it is better

Demo

Questions? Upcoming things Paper Review #7a or #7b due today Paper Presentation #8a and #8b on Wednesday