Large-scale Deep Unsupervised Learning using Graphics Processors

Similar documents
Unsupervised Deep Learning for Scene Recognition


CafeGPI. Single-Sided Communication for Scalable Deep Learning

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

JCudaMP: OpenMP/Java on CUDA

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Tesla Architecture, CUDA and Optimization Strategies

OpenACC Fundamentals. Steve Abbott November 15, 2017

University of Bielefeld

Multi-Processors and GPU

Neural Networks and Deep Learning

High Performance Computing and GPU Programming

Introduction to GPGPU and GPU-architectures

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduc)on to GPU Programming

3D Object Recognition with Deep Belief Nets

Lecture 1: Introduction and Computational Thinking

Deep Learning. Volker Tresp Summer 2014

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Warps and Reduction Algorithms

Deep Boltzmann Machines

Device Memories and Matrix Multiplication

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

Large scale Imaging on Current Many- Core Platforms

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

COSC 6339 Accelerators in Big Data

Lab 1 Part 1: Introduction to CUDA

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-core Coprocessor

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Deep Learning with Tensorflow AlexNet

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Introduction to CUDA Programming

Accelerated Test Execution Using GPUs

Parallel Accelerators

SATGPU - A Step Change in Model Runtimes

Implicit Mixtures of Restricted Boltzmann Machines

Advanced Introduction to Machine Learning, CMU-10715

Inter-Block GPU Communication via Fast Barrier Synchronization

Asian Option Pricing on cluster of GPUs: First Results

Lecture 8: GPU Programming. CSE599G1: Spring 2017

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Institute of Cardiovascular Science, UCL Centre for Cardiovascular Imaging, London, United Kingdom, 2

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

FMM implementation on CPU and GPU. Nail A. Gumerov (Lecture for CMSC 828E)

Parallel Accelerators

Accelerating image registration on GPUs

ImageNet Classification with Deep Convolutional Neural Networks

Dense Linear Algebra. HPC - Algorithms and Applications

Programming GPUs for database applications - outsourcing index search operations

CUDA Memory Model. Monday, 21 February Some material David Kirk, NVIDIA and Wen-mei W. Hwu, (used with permission)

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

High-Performance Computing Using GPUs

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient

SEASHORE / SARUMAN. Short Read Matching using GPU Programming. Tobias Jakobi

Shortening Time Required for Adaptive Structural Learning Method of Deep Belief Network with Multi-Modal Data Arrangement

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Cartoon parallel architectures; CPUs and GPUs

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

GPU-Accelerated Deep Learning

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Scientific Computations Using Graphics Processors

HPC with Multicore and GPUs

Profiling GPU Code. Jeremy Appleyard, February 2016

Why DNN Works for Speech and How to Make it More Efficient?

DRAM Bank Organization

Krishnan Suresh Associate Professor Mechanical Engineering

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

CS 152, Spring 2011 Section 10

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Adaptive NURBS Tessellation with CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

CUB. collective software primitives. Duane Merrill. NVIDIA Research

HMPP port. G. Colin de Verdière (CEA)

Robert Dadashi-Tazehozi. rd2669. Deep Learning for Computer Vision and Natural Language Processing EECS 6894

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Deep Learning & Neural Networks

Using a GPU in InSAR processing to improve performance

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

INTRODUCTION TO ACCELERATED COMPUTING WITH OPENACC. Jeff Larkin, NVIDIA Developer Technologies

Learning robust features from underwater ship-radiated noise with mutual information group sparse DBN

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Parallel Implementation of Deep Learning Using MPI

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Transcription:

Large-scale Deep Unsupervised Learning using Graphics Processors Rajat Raina Anand Madhavan Andrew Y. Ng Stanford University

Learning from unlabeled data Classify vs. car motorcycle Input space Unlabeled examples Learn higherlevel representation Deep Belief Networks Sparse Coding Higher-level representa

The promise of unsupervised learning Use large amounts of unlabeled data to learn complex/deep models, possibly with many parameters.

Some recent work on DBNs Published Source Hinton et al. Hinton & Salakhutdinov Salakhutdinov & Hinton Ranzato & Szummer Domain Handwritten digits Face images Information retrieval Number of free parameters 1.6 million 3 million 2.6 million Text documents 3.6 million Our DBN model over images 100 million (Similar situation for sparse coding.)

Large-scale learning [Banko & Brill, 2001]

Large-scale unsupervised learning Current models: 1000s of input dimensions, 1000s of hidden units. 10 6 parameters. Our desired model: 10 8 parameters

Graphics Processors CPU RAM Graphics Card (GPU) Motherboard

Why graphics processors? 1000 NVIDIA GPU Peak Gflops (billion ops / sec) 750 500 250 0 Intel CPU 2003 2004 2005 2006 2007 2008 (Source: NVIDIA CUDA Programming Guide)

Why graphics processors? IBM ASCI White Supercomputer Cost: $110 million Space: 2 basketball courts 13 graphics cards

GPU Schematic 30 MPs MP MP MP 1000 GB/s Shared Memory (16K) Shared Memory (16K) Shared Memory (16K) SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Registers Registers 100 GB/s (coalesced) Registers Global Memory (~1GB) Slow transfer from RAM RAM (Note: Some additional features not displayed.)

GPU Programming MP MP MP Two-level parallelism Split task into blocks, blocks into threads. Access to global memory (not RAM). Restrictions on memory access patterns. Main bottleneck: Getting data into GPU memory, and accessing it in efficient ways. NVIDIA CUDA High-level routines to allocate/copy GPU memory. Good GPU matrix libraries that suffice for many machine learning tasks. Shared Memory SP SP SP SP SP SP SP SP Shared Memory SP SP SP SP SP SP SP SP Global Memory (~1GB) RAM Shared Memory SP SP SP SP SP SP SP SP

Unsupervised learning on GPUs Initialize parameters in global memory. while convergence criterion is not satisfied end Periodically transfer a large number of unlabeled examples into global memory. Pick a few of the unlabeled examples at a time, and compute the updates in parallel using the GPU's two-level parallelism (blocks and threads) or GPU matrix libraries. Transfer learnt parameters from global memory.

Deep Belief Networks Learning Large DBNs using Graphics Processors Rajat Raina, Andrew Y. Ng

Restricted Boltzmann Machine (RBM) h 1 h 2... h v 1 v 2 v 3... v p(v,h) e E(v,h) E(v,h) ( vw i ijhj civi i,j Contrastive divergence learning via conditional T distributions: p(h v) g( W v b) p(v h) i g( Wh c) j b j h j )

Experimental setup Single graphics card: Nvidia GTX 280 1GB on-board memory, 240 cores. Current price: US $250. CPU: Two cores, each @3.16GHz.

Learning Large RBMs Learning time for 10 million example s (log scale) 1 week 1 day 1 hour 8 hours ½ hour 35 hours Dual-core CPU 2 hours GPU 2 weeks 72x faster 5 hours 1 18 36 45 Millions of parameters

Overlapping patches DBN Hidden Units. A..... W A, b A, c A Hidden Units B W B, b B, c B Patch A Patch B Input image

Overlapping patches DBN example 8192 units 2048 units 1568 0 units 110 million parameters... 32768 units (128 units per 24x24 patch) 20736 units (144x144) All layers can be learnt in about 1 day on a GPU.

Sparse Coding

Sparse coding Basis vectors b Input = 0.8 * + 0.3 * + 0.5 * b 411 x = 0.8 * b 87 + 0.3 * b 376 + 0.5 * Given unlabeled data x (i), obtain b by solving: min b a ( i) ( i) 2 a bj j 2 Alternating minimization Keep a fixed, find optimal b. Keep b fixed, find optimal a. i ( i) x a 1 j i, j : b 1 j Activations a

Parallel Sparse Coding min ( i) ( i) 2 a bj j 2 Alternating minimization Keep a fixed, find optimal b. Easy on GPU (projected grad descent). Keep b fixed, find optimal a. Not as straightforward. ( i) b, a x a 1 i j i j : b 1 j Need to parallelize: min a x j a j b j 2 2 a 1

Parallel Sparse Coding Easy to optimize for one coordinate (keeping the others fixed). 2007) min a One iteration of our algorithm: a x j a j b j 2 2 a 1 * a 1 (Friedman et al., * a 2 a new Descent direction

Sparse coding with 10 6 parameters 20 19 days 15 Learning time (days) with 10 million examples 10 5 0 Dual-core CPU GPU 3% nonzero 10% Sparsity nonzero 15x faster 1 day 6 hours

Summary Large-scale unsupervised learning. Ten-times more data might transform an OK algorithm into a good algorithm. Working at smaller-scale risks confounding the effects of the model itself, with the effect of scale. GPUs are a powerful tool for machine learning. Easy to program (no low-level programming). Especially useful for stochastic learning methods. Learning algorithms for DBNs and sparse coding can be an order-of-magnitude faster.

THE END

Why graphics processors? 120 100 NVIDIA GPU Bandwidth from memory to processor (GB/s) 80 60 40 20 0 2003 2004 2005 2006 2007 Intel CPU (Source: NVIDIA CUDA Programming Guide)

GPU Programming: A=A+B global void vecadd(float* A, float* B){ int my = threadidx.x + blockidx.x * 128; A[my]=A[my]+B[my]; } GPU int main(int argc, char** argv){ float A[SIZE], B[SIZE]; float* d_a, * d_b; cudamalloc((void**)&d_a,size_bytes); cudamalloc((void**)&d_b,size_bytes); cudamemcpy(d_a,a,size_bytes,cudamemcpyhosttodevice); cudamemcpy(d_b,b,size_bytes,cudamemcpyhosttodevice); CPU } vecadd<<<32,128>>>(d_a,d_b); cudathreadsynchronize(); cudamemcpy(a,d_a,size_bytes,cudamemcpydevicetohost); (Adapted from http://www.cs.technion.ac.il/~marks/docs/linuxclubgpgpu.pdf)