Memory Bandwidth and Low Precision Computation. CS6787 Lecture 9 Fall 2017

Similar documents
Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

Parallelism. CS6787 Lecture 8 Fall 2017

Analyzing Stochastic Gradient Descent for Some Non- Convex Problems

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Understanding and Optimizing Asynchronous Low-Precision Stochastic Gradient Descent

Floating Point Arithmetic

Floating Point Numbers

Floating Point Numbers

Linear Regression Optimization

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

1 Overview, Models of Computation, Brent s Theorem

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

Floating Point : Introduction to Computer Systems 4 th Lecture, May 25, Instructor: Brian Railing. Carnegie Mellon

A Brief Look at Optimization

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Parallel Stochastic Gradient Descent: The case for native GPU-side GPI

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

Kinds Of Data CHAPTER 3 DATA REPRESENTATION. Numbers Are Different! Positional Number Systems. Text. Numbers. Other

COMP2611: Computer Organization. Data Representation

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

Floating Point Numbers

Scientific Computing. Error Analysis

Numerical Precision. Or, why my numbers aren t numbering right. 1 of 15

Numerical computing. How computers store real numbers and the problems that result

COSC 311: ALGORITHMS HW1: SORTING

System Programming CISC 360. Floating Point September 16, 2008

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

ECE 5775 (Fall 17) High-Level Digital Design Automation. Fixed-Point Types Analysis of Algorithms

CS 6210 Fall 2016 Bei Wang. Lecture 4 Floating Point Systems Continued

CS429: Computer Organization and Architecture

Written Homework 3. Floating-Point Example (1/2)

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Data Representation Floating Point

Number Representations

CS281 Section 3: Practical Optimization

Caching Basics. Memory Hierarchies

Lecture 13: (Integer Multiplication and Division) FLOATING POINT NUMBERS

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Data Representation Floating Point

Fundamental CUDA Optimization. NVIDIA Corporation

Memory. Objectives. Introduction. 6.2 Types of Memory

Block Size Tradeoff (1/3) Benefits of Larger Block Size. Lecture #22 Caches II Block Size Tradeoff (3/3) Block Size Tradeoff (2/3)

Foundations of Computer Systems

Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

Divide: Paper & Pencil

Floating point. Today. IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

CO Computer Architecture and Programming Languages CAPL. Lecture 15

How Learning Differs from Optimization. Sargur N. Srihari

Floating Point January 24, 2008

Cache introduction. April 16, Howard Huang 1

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic

Giving credit where credit is due

CSE 410 Computer Systems. Hal Perkins Spring 2010 Lecture 12 More About Caches

Up next. Midterm. Today s lecture. To follow

Computer Architecture Chapter 3. Fall 2005 Department of Computer Science Kent State University

Giving credit where credit is due

CS 6453: Parameter Server. Soumya Basu March 7, 2017

Accuracy versus precision

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Deep Neural Networks Optimization

Floating Point. CSE 238/2038/2138: Systems Programming. Instructor: Fatma CORUT ERGİN. Slides adapted from Bryant & O Hallaron s slides

Module 2: Computer Arithmetic

Approximating Square Roots

Computer Architecture Review. Jo, Heeseung

Computer Architecture and IC Design Lab. Chapter 3 Part 2 Arithmetic for Computers Floating Point

Dr. Chuck Cartledge. 3 June 2015

ECS289: Scalable Machine Learning

Machine Learning 13. week

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

Heterogenous Computing

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Dropout. Sargur N. Srihari This is part of lecture slides on Deep Learning:

Matthieu Lefebvre Princeton University. Monday, 10 July First Computational and Data Science school for HEP (CoDaS-HEP)

www-inst.eecs.berkeley.edu/~cs61c/

Chapter 3: Arithmetic for Computers

CS321 Introduction To Numerical Methods

Fundamental CUDA Optimization. NVIDIA Corporation

Floating Point Puzzles The course that gives CMU its Zip! Floating Point Jan 22, IEEE Floating Point. Fractional Binary Numbers.

Representing and Manipulating Floating Points. Jo, Heeseung

Representing and Manipulating Floating Points

Physics 331 Introduction to Numerical Techniques in Physics

Watershed Sciences 4930 & 6920 GEOGRAPHIC INFORMATION SYSTEMS

ECS289: Scalable Machine Learning

Recap: Machine Organization

UCB CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures

Data Representation Type of Data Representation Integers Bits Unsigned 2 s Comp Excess 7 Excess 8

Number Systems and Computer Arithmetic

15-451/651: Design & Analysis of Algorithms November 4, 2015 Lecture #18 last changed: November 22, 2015

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Signed Multiplication Multiply the positives Negate result if signs of operand are different

Transcription:

Memory Bandwidth and Low Precision Computation CS6787 Lecture 9 Fall 2017

Memory as a Bottleneck So far, we ve just been talking about compute e.g. techniques to decrease the amount of compute by decreasing iterations But machine learning systems need to process huge amounts of data Need to store, update, and transmit this data As a result: memory is of critical importance Many applications are memory-bound

Memory: The Simplified Picture Compute RAM

Memory: The Multicore Picture Compute L1 L2 Compute L1 L2 L3 RAM Compute L1 L2

Memory: The Multisocket Picture Compute L1 L2 Compute L1 L2 L3 RAM Compute L1 L2 Compute L1 L2 Compute L1 L2 L3 RAM Compute L1 L2

Memory: The Distributed Picture GPU Network GPU

What can we learn from these pictures? Many more memory boxes than compute boxes And even more as we zoom out Memory has a hierarchical structure Locality matters Some memory is closer and easier to access than others Also have standard concerns for CPU locality

What limits us? Memory capacity How much data can we store locally in RAM and/or in? Memory bandwidth How much data can we load from some source in a fixed amount of time? Memory locality Roughly, how often is the data that we need stored nearby? Power How much energy is required to operate all of this memory?

One way to help: Low-Precision Computation

Low-Precision Computation Traditional ML systems use 32-bit or 64-bit floating point numbers But do we actually need this much precision? Especially when we have inputs that come from noisy measurements Idea: instead use 8-bit or 16-bit numbers to compute Can be either floating point or fixed point On an FPGA or ASIC can use arbitrary bit-widths

Low Precision and Memory Major benefit of low-precision: uses less memory bandwidth Precision in DRAM Memory Throughput 64-bit float vector F64 F64 F64 5 numbers/ns 32-bit float vector F32 F32 F32 F32 F32 F32 10 numbers/ns 16-bit int vector 20 numbers/ns 8-bit int vector 40 numbers/ns (assuming ~40 GB/sec memory bandwidth)

Low Precision and Memory Major benefit of low-precision: takes up less space Precision in DRAM Cache Capacity 64-bit float vector F64 F64 F64 4 M numbers 32-bit float vector F32 F32 F32 F32 F32 F32 8 M numbers 16-bit int vector 16 M numbers 8-bit int vector 32 M numbers (assuming ~32 MB )

Low Precision and Parallelism Another benefit of low-precision: use SIMD instructions to get more parallelism on CPU SIMD Precision SIMD Parallelism 64-bit float vector F64 F64 F64 F64 32-bit float vector F32 F32 F32 F32 F32 F32 F32 F32 16-bit int vector 8-bit int vector 4 multiplies/cycle (vmulpd instruction) 8 multiplies/cycle (vmulps instruction) 16 multiplies/cycle (vpmaddwd instruction) 32 multiplies/cycle (vpmaddubsw instruction)

Low Precision and Power Low-precision computation can even have a super-linear effect on energy float32 int16 float32 float32 multiplier int16 int16 mul Memory energy can also have quadratic dependence on precision algorithm runtime float32 float32 memory

Effects of Low-Precision Computation Pros Fit more numbers (and therefore more training examples) in memory Store more numbers (and therefore larger models) in the Transmit more numbers per second Compute faster by extracting more parallelism Use less energy Cons Limits the numbers we can represent Introduces quantization error when we store a full-precision number in a lowprecision representation

Ways to represent low-precision numbers

FP16/Half-precision floating point 16-bit floating point numbers 1-bit sign 5-bit exponent 10-bit significand Usually, the represented value is x =( 1) sign bit 2 exponent 15 1.significand 2

Arithmetic on half-precision floats Complicated Has to handle adding numbers with different exponents and signs To be efficient, needs to be supported in hardware Inexact Operations can experience overflow/underflow just like with more common floating point numbers, but it happens more often Can represent a wide range of numbers Because of the exponential scaling

Half-precision floating point support Supported on some modern GPUs Including new efficient implementation on NVIDIA Pascal GPUs Good empirical results for deep learning

Fixed point numbers p + q + 1 bit fixed point number 1-bit sign p-bit integer part q-bit fractional part The represented number is x =( 1) sign bit integer part +2 q fractional part =2 q whole thing as signed integer

Example: 8-bit fixed point number It s common to want to represent numbers between -1 and 1 To do this, we can use a fixed point number with all fractional bits 1-bit sign 7-bit fractional part If the number as an integer is k, then the represented number is x =2 7 k 2 1, 127 128,..., 1 128, 0, 1 128,...,126 128, 127 128

More generally: scaled fixed point numbers Sometimes we don t want the decimal point to lie between two bits that we are actually storing We might want more tight control over what our bits mean Idea: pick a real-number scale factor s, then let integer k represent x = s k This is a generalization of traditional fixed point, where s =2 #offractionalbits

Arithmetic on fixed point numbers Simple Can just use preexisting integer processing units Mostly exact Underflow impossible Overflow can happen, but is easy to understand Can always convert to a higher-precision representation to avoid overflow Can represent a much narrower range of numbers than float

Example: Exact Fixed Point Multiply When we multiply two integers, if we want the result to be exact, we need to convert to a representation with more bits For example, if we take the product of two 8-bit numbers, the result should be a 16-bit number to be exact. Why? 100 x 100 = 10000 which can t be stored as an 8-bit number To have exact fixed point multiply, we can do the same thing Since fixed-point operations are just integer operations behind the scenes

Support for fixed-point arithmetic Anywhere integer arithmetic is supported CPUs, GPUs Although not all GPUs support 8-bit integer arithmetic And AVX2 does not have all the 8-bit arithmetic instructions we d like Particularly effective on FPGAs and ASICs Where floating point units are costly Sadly, very little support for other precisions 4-bit operations would be particularly useful

Custom Quantization Points Even more generally, we can just have a list of 2 b numbers and say that these are the numbers a particular low-precision string represents We can think of the bit string as indexing a number in a dictionary Gives us total freedom as to range and scaling But computation can be tricky Some recent research into using this with hardware support The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning (Zhang et al 2017)

Recap of low-precision representations Half-precision floating-point Complicated arithmetic, but good with hardware support Difficult to reason about overflow and underflow Better range No 8-bit support as of yet Fixed-point Simple arithmetic, supported wherever integers are Easy to reason about overflow, but has worse range Supports 8-bit and 16-bit arithmetic, but little to no 4-bit support

Low-Precision SGD

Recall: SGD update rule w t+1 = w t t rf(w t ; x t,y t ) There are a lot of numbers we can make low-precision here We can quantize the input dataset x, y We can quantize the model w We can try to quantize within the gradient computation itself We can try to quantize the communication among the parallel workers

Four Broad Classes of Numbers Dataset numbers used to store the immutable input data Model numbers used to represent the vector we are updating Gradient numbers used as intermediates in gradient computations Communication numbers used to communicate among parallel workers

Quantize classes independently Using low-precision for different number classes has different effects on throughput. Quantizing the dataset numbers improves memory capacity and overall training example throughput Quantizing the model numbers improves capacity and saves on compute Quantizing the gradient numbers saves compute Quantizing the communication numbers saves on expensive inter-worker memory bandwidth

Quantize classes independently Using low-precision for different number classes has different effects on statistical efficiency and accuracy. Quantizing the dataset numbers means you re solving a different problem Quantizing the model numbers adds noise to each gradient step, and often means you can t exactly represent the solution Quantizing the gradient numbers can add errors to each gradient step Quantizing the communication numbers can add errors which cause workers local models to diverge, which slows down convergence

Theoretical Guarantees for Low Precision Reducing precision adds noise in the form of round-off error. Two approaches to rounding: biased rounding round to nearest number unbiased rounding round randomly: E Q x Using this, we can prove guarantees that SGD works with a low precision model. Taming the Wild [NIPS 2015] = x I also proved we can combine30% low-precision 70% computation with asynchronous execution, which we call BUCKWILD! 2.0 2.7 3.0

Why unbiased rounding? Imagine running SGD with a low-precision model with update rule w t+1 = Q (w t t rf(w t ; x t,y t )) Here, Q is an unbiased quantization function In expectation, this is just gradient descent h i E[w t+1 w t ]=E Q (wt t rf(w t ; x t,y t )) w t = E [w t t rf(w t ; x t,y t ) w t ] = w t t rf(w t )

Doing unbiased rounding efficiently We still need an efficient way to do unbiased rounding Pseudorandom number generation can be expensive E.G. doing C++ rand or using Mersenne twister takes many clock cycles Empirically, we can use very cheap pseudorandom number generators And still get good statistical results For example, we can use XORSHIFT which is just a cyclic permutation

Memory Locality and Scan Order

Memory Locality: Two Kinds Memory locality is needed for good performance Temporal locality Frequency of reuse of the same data within a short time window Spatial locality Frequency of use of data nearby data that has recently been used Where is there locality in stochastic gradient descent?

Problem: no dataset locality across iterations The training example at each iteration is chosen randomly Called a random scan order Impossible for the to predict what data will be needed w t+1 = w t t rf(w t ; x t,y t ) Idea: process examples in the order in which they are stored in memory Called a systematic scan order or sequential scan order Does this improve the memory locality?

Random scan order vs. sequential scan order Much easier to prove theoretical results for random scan But sequential scan has better systems performance In practice, almost everyone uses sequential scan There s no empirical evidence that it s statistically worse in most cases Even though we can construct cases where using sequential scan does harm the convergence rate

Other scan orders Shuffle-once, then sequential scan Shuffle the data once, then systematically scan for the rest of execution Statistically very similar to random scan at the state Random reshuffling Randomly shuffle on every pass through the data Believed to be always at least as good as both random scan and sequential scan But no proof that it is better

Questions? Upcoming things Paper Review #8 due today Paper Presentation #9 on Wednesday