OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi
|
|
- Augustus Cox
- 6 years ago
- Views:
Transcription
1 OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015
2 Outline About nervana Optimizing deep learning at assembler level Limited precision for deep learning neon benchmarks 2
3 About nervana A platform for machine intelligence enable deep learning at scale optimized from algorithms to silicon X 3 About Kernels neon Summary
4 Verticals Medical Finance Pharma Oil&Gas Agriculture $ 4 About Kernels neon Summary
5 Verticals Medical Finance Pharma Oil&Gas Agriculture $ Deep learning supplanting traditional approaches everywhere Small improvements have large impact Customers require clear roadmap that scales to growing need. 4 About Kernels neon Summary
6 nervana platform for deep learning train explore deploy nervana framework Data nervana cloud Solutions 5 About Kernels neon Summary
7 nervana platform for deep learning train explore deploy nervana framework Data nervana cloud Solutions GPUs CPUs nervana engine 5 About Kernels neon Summary
8 maxas: a Maxwell Assembler Full control of: register allocation instruction ordering control codes barriers, stall counts Built-in scheduler (optional) Meta-programming 6 About Kernels [ maxas ] neon Summary
9 maxas: a Maxwell Assembler Full control of: register allocation instruction ordering control codes barriers, stall counts Built-in scheduler (optional) Meta-programming Scott Gray 6 About Kernels [ maxas ] neon Summary
10 maxas: a Maxwell Assembler Full control of: register allocation instruction ordering control codes barriers, stall counts Built-in scheduler (optional) Meta-programming Scott Gray See GitHub repo for docs and examples 6 About Kernels [ maxas ] neon Summary
11 ptxas struggles with Instruction Level Parallelism 25" 20" Distribu4on&of&Number&of&Instruc4ons&Between&LDS& and&dependant&ffma&operands& Count& 15" 10" Bad Good ptx" cublas" 5" 0" 1" 6" 11" 16" 21" 26" 31" 36" 41" 46" 51" 56" 61" 66" 71" 76" 81" 86" 91" 96" 101" 106" 111" 116" 121" 126" 131" 136" 141" 146" 151" FFMA&Line#&.&LDS&Line#& courtesy Scott Gray 7 About Kernels [ maxas ] neon Summary
12 Easy register allocation through maxas c Register banking for outer products c = a b t a b 8 About Kernels [ maxas ] neon Summary
13 Example GEMM code in maxas 9 About Kernels [ maxas ] neon Summary
14 Example GEMM code in maxas Load from shared 9 About Kernels [ maxas ] neon Summary
15 Example GEMM code in maxas Load from shared Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary
16 Example GEMM code in maxas Load from shared Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary
17 Example GEMM code in maxas Dual issue instr. Load from shared Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary
18 Example GEMM code in maxas Dual issue instr. Load from shared Set barrier Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary
19 Example GEMM code in maxas Dual issue instr. Load from shared Set barrier Barrier sync Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary
20 Convolution kernels for deep learning Input Filters Output H R * S = K P C W S R K Q C H x W R x S K P x Q N Number of input channels Input spatial dims Filter spatial dims Number of filters Output spatial dims Mini-batch dim (not shown) 10 About Kernels [ Convolution ] neon Summary
21 Access patterns for matrix lowering Convolution kernels: 11 About Kernels [ Convolution ] neon Summary
22 Access patterns for matrix lowering Convolution kernels: fprop 11 About Kernels [ Convolution ] neon Summary
23 Access patterns for matrix lowering Convolution kernels: bprop Backprop(Step(1( δ 1 ( P(=(Q(=(2( K(=(2( C(=(3( K(=(2( R(=(S(=(2(( δ 0 ( N(=(3( C(=(3( H(=(W(=(3( ( 11 About Kernels [ Convolution ] neon Summary
24 Access patterns for matrix lowering Convolution kernels: update Backprop(Step(2( (Weight(Updates( δ 1 ( Output(of(the(previous(layer ( P(=(Q(=(2( K(=(2( Weight(updates( C(=(3( K(=(2( R(=(S(=(2(( N(=(3( C(=(3( H(=(W(=(3( 11 About Kernels [ Convolution ] neon Summary
25 Deep learning with low precision works 12 About Kernels [ Limited Precision ] neon Summary
26 Deep learning with low precision works Improving the speed of neural networks on CPUs Vincent Vanhoucke Google, Inc. Mountain View, CA Andrew Senior Google, Inc. New York, NY Mark Z. Mao Google, Inc. Mountain View, CA Abstract Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3 improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10 speedup over an unoptimized baseline and a 4 speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware. 12 About Kernels [ Limited Precision ] neon Summary
27 Deep learning with low precision works LOW PRECISION ARITHMETIC FOR DEEP LEARNING Matthieu Courbariaux & Jean-Pierre David Department of Electrical Engineering École Polytechnique de Montréal Montréal, QC H3T 1J4, Canada Yoshua Bengio Department of Computer Science and Operations Research Université de Montréal Montréal, QC H3T 1J4, Canada ABSTRACT We simulate the training of a set of state of the art neural networks, the Maxout networks (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST, CIFAR10 and SVHN, with three distinct arithmetics: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those arithmetics, we assess the impact of the precision of the computations on the final error of the training. We find that very low precision computation is sufficient not just for running trained networks but also for training them. For example, almost state-of-the-art results were obtained on most datasets with around 10 bits for computing activations and gradients, and 12 bits for storing updated parameters. 12 About Kernels [ Limited Precision ] neon Summary
28 Deep learning with low precision works Deep Learning with Limited Numerical Precision Suyog Gupta Ankur Agrawal Kailash Gopalakrishnan IBM T. J. Watson Research Center, Yorktown Heights, NY Pritish Narayanan IBM Almaden Research Center, San Jose, CA Abstract Training of large-scale deep neural networks is often constrained by the available computational resources. We study the e ect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network s behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-e cient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. 12 About Kernels [ Limited Precision ] neon Summary
29 neon: nervana python deep learning library 13 About Kernels neon Summary
30 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism 13 About Kernels neon Summary
31 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models 13 About Kernels neon Summary
32 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud 13 About Kernels neon Summary
33 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud Supports multiple backends nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon) { } 13 About Kernels neon Summary
34 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud Supports multiple backends Multiple limited precision options nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon) { } 13 About Kernels neon Summary
35 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud Supports multiple backends Multiple limited precision options Optimized for Maxwell at assembler level nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon) { } 13 About Kernels neon Summary
36 neon: easy model configuration 14 About Kernels neon Summary
37 neon: easy model configuration Dataset 14 About Kernels neon Summary
38 neon: easy model configuration Dataset Weight initialization 14 About Kernels neon Summary
39 neon: easy model configuration Dataset Weight initialization Learning rule 14 About Kernels neon Summary
40 neon: easy model configuration Dataset Weight initialization Learning rule Model layers and cost 14 About Kernels neon Summary
41 neon experiments in fp16/32 15 About Kernels neon Summary
42 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format 15 About Kernels neon Summary
43 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 15 About Kernels neon Summary
44 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection 15 About Kernels neon Summary
45 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection Python element-wise operations auto-compiled into kernels 15 About Kernels neon Summary
46 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection Python element-wise operations auto-compiled into kernels fp16 accumulations done carefully to minimize errors 15 About Kernels neon Summary
47 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection Python element-wise operations auto-compiled into kernels fp16 accumulations done carefully to minimize errors Working with collaborators (Baidu, Bengio lab) to improve 15 About Kernels neon Summary
48 fp16/32 accuracy No accuracy loss going from fp32 to fp16 fp32 Count Error (%) distribution over 25 reruns Error (%) distribution over 25 runs 16 About Kernels neon Summary
49 fp16/32 accuracy No accuracy loss going from fp32 to fp16 fp16 fp32 Count Error (%) distribution over 25 reruns Error (%) distribution over 25 runs 16 About Kernels neon Summary
50 fp16/32 accuracy No accuracy loss going from fp32 to fp16 fp 16 sto fp16 fp32 Count Error (%) distribution over 25 reruns Error (%) distribution over 25 runs 16 About Kernels neon Summary
51 Speed benchmarks 1 : fp16 vs others 600 neon fp16 neon Cudaconvnet2 Torch7 cudnn* cudanet 500 Time per layer (ms) layers convolutional forward pass, layers, 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 *2 nd, 3 rd layer don t fit on a 4GB card 1 Soumith Chintala, github.com/soumith/convnet-benchmarks 17 About Kernels neon Summary
52 Speed benchmarks 1 : fp16 vs fp neon fp16 neon Cudaconvnet2 cudanet Torch7 cudnn * 500 Time per layer (ms) layers convolutional forward pass, layers, 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 1 Soumith Chintala, github.com/soumith/convnet-benchmarks *some layers do not fit on a 4GB card X About Kernels neon Summary
53 Speed benchmarks 1 : fp16 vs fp neon fp16 neon Cuda- cudanet convnet2 Torch7 cudnn * 500 Time per layer (ms) layers convolutional forward pass, layers, 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 1 Soumith Chintala, github.com/soumith/convnet-benchmarks *some layers do not fit on a 4GB card X About Kernels neon Summary
54 Speed benchmarks 1 : fp16 vs fp neon fp16 neon Cuda- cudanet convnet2 Torch7 cudnn* 500 Time Time per per layer layer (ms) (ms) convolutional layers forward pass, layers, 5 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 1 Soumith Chintala, github.com/soumith/convnet-benchmarks *some layers do not fit on a 4GB card X About Kernels neon Summary
55 Benchmarks 1 show 2x performance Raw numbers (averaged over 10 runs) Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)
56 Benchmarks 1 show 2x performance Alexnet Raw numbers (averaged over 10 runs) Avg(10) fprop: msecs gflops Avg(10) bprop: msecs gflops Avg(10) total: msecs gflops Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)
57 Benchmarks 1 show 2x performance Overfeat Raw numbers (averaged over 10 runs) Avg(10) fprop: msecs gflops Avg(10) bprop: msecs gflops Avg(10) total: msecs gflops Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)
58 Benchmarks 1 show 2x performance VGG (N=64) Raw numbers (averaged over 10 runs) Avg(10) fprop: msecs gflops Avg(10) bprop: msecs gflops Avg(10) total: msecs gflops Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)
59 Summary 19 About Kernels neon Summary
60 Summary neon: User-friendly python library 19 About Kernels neon Summary
61 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning 19 About Kernels neon Summary
62 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU 19 About Kernels neon Summary
63 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU Limited precision allows for larger models 19 About Kernels neon Summary
64 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU Limited precision allows for larger models Toolbox for exploring numerical representations 19 About Kernels neon Summary
65 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU Limited precision allows for larger models Toolbox for exploring numerical representations 19 About Kernels neon Summary
66 GTC 2015 Contact us at We are hiring! Cloud engineers GPU experts machine learning engineers software engineers Sign up to try neon, our deep learning library. We can help solve your problem. 20 About Kernels neon Summary
High Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationDEEP NEURAL NETWORKS AND GPUS. Julie Bernauer
DEEP NEURAL NETWORKS AND GPUS Julie Bernauer GPU Computing GPU Computing Run Computations on GPUs x86 CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int
More informationDeep Learning on Modern Architectures. Keren Zhou 4/17/2017
Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others
More informationA performance comparison of Deep Learning frameworks on KNL
A performance comparison of Deep Learning frameworks on KNL R. Zanella, G. Fiameni, M. Rorro Middleware, Data Management - SCAI - CINECA IXPUG Bologna, March 5, 2018 Table of Contents 1. Problem description
More informationProfiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang
Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework
More informationHigh-Performance Data Loading and Augmentation for Deep Neural Network Training
High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose
More informationOptimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs
Normalized execution time Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Chao Li # Yi Yang* Min Feng* Srimat Chakradhar* Huiyang Zhou # # Department of Electrical and Computer
More informationMIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius
MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety
More informationDeep Learning with Tensorflow AlexNet
Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification
More informationDEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA
DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA TOPICS COVERED Convolutional Networks Deep Learning Use Cases GPUs cudnn 2 MACHINE LEARNING! Training! Train the model from supervised
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationarxiv: v5 [cs.lg] 23 Sep 2015
TRAINING DEEP NEURAL NETWORKS WITH LOW PRECISION MULTIPLICATIONS Matthieu Courbariaux & Jean-Pierre David École Polytechnique de Montréal {matthieu.courbariaux,jean-pierre.david}@polymtl.ca arxiv:1412.7024v5
More informationOPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS
April 4-7, 2016 Silicon Valley OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS Jeremy Appleyard, 7 April 2016 RECURRENT NEURAL NETWORKS Output is fed into input Perform the same operation repeatedly
More informationGPU Coder: Automatic CUDA and TensorRT code generation from MATLAB
GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Ram Kokku 2018 The MathWorks, Inc. 1 GPUs and CUDA programming faster Performance CUDA OpenCL C/C++ GPU Coder MATLAB Python Ease of programming
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationDistributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability
Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,
More informationInference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA
Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos
More informationEFFICIENT INFERENCE WITH TENSORRT. Han Vanholder
EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationCOMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017
COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization
More informationDeep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers
Deep Learning Workshop Nov. 20, 2015 Andrew Fishberg, Rowan Zellers Why deep learning? The ImageNet Challenge Goal: image classification with 1000 categories Top 5 error rate of 15%. Krizhevsky, Alex,
More informationImplementing Deep Learning for Video Analytics on Tegra X1.
Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationFast Algorithms for Convolutional Neural Networks
Fast Algorithms for Convolutional Neural Networks Andrew Lavin alavin@acm.org Scott Gray Nervana Systems sgray@nervanasys.com Abstract Deep convolutional neural networks take GPU-days of computation to
More informationDEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla
DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple
More informationNVIDIA FOR DEEP LEARNING. Bill Veenhuis
NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA
More informationHalf Precision Benchmarking for HPC
PiotrLuszczek Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU Technology Conference, San Jose, CA, USA 1 / 18 May 11, 2017 GPU Technology Conference, San Jose, CA, USA 2 / 18 Major Floating
More informationIMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.
IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC. CONTENTS Deep Learning Review Implementation on GPU using cudnn Optimization Issues Introduction to VUNO-Net DEEP LEARNING REVIEW BRIEF HISTORY OF NEURAL
More informationGPU-Accelerated Deep Learning
GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationAccelerating cublas/cudnn using Input-Aware Auto-Tuning
Accelerating cublas/cudnn using Input-Aware Auto-Tuning The ISAAC library Philippe Tillet Harvard University Introduction cublas does not always achieve peak performance: (M, N, K) 1 = (4096, 4096, 4096):
More informationImplementation of Deep Convolutional Neural Net on a Digital Signal Processor
Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm
More informationScaling Deep Learning. Bryan
Scaling Deep Learning @ctnzr What do we want AI to do? Guide us to content Keep us organized Help us find things Help us communicate 帮助我们沟通 Drive us to work Serve drinks? Image Q&A Baidu IDL Sample questions
More informationKeras: Handwritten Digit Recognition using MNIST Dataset
Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30
More informationEffectively Scaling Deep Learning Frameworks
Effectively Scaling Deep Learning Frameworks (To 40 GPUs and Beyond) Welcome everyone! I m excited to be here today and get the opportunity to present some of the work that we ve been doing at SVAIL, the
More informationNVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS
TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationDeep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.
Deep Learning 861.061 Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD asan.agibetov@meduniwien.ac.at Medical University of Vienna Center for Medical Statistics,
More informationDeep Learning with Torch
Deep Learning with Torch The good, the bad, the ugly since 2002 Jimmy Ba jimmy@psi.utoronto.ca What is Torch? Year 2012 Google Answer: Torch7 provides a Matlab-like environment for state-of-the-art machine
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationCNN optimization. Rassadin A
CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage
More informationTETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory
TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationNVIDIA Update and Directions on GPU Acceleration for Earth System Models
NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationDeep Learning and Its Applications
Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent
More informationProfiling GPU Code. Jeremy Appleyard, February 2016
Profiling GPU Code Jeremy Appleyard, February 2016 What is Profiling? Measuring Performance Measuring application performance Usually the aim is to reduce runtime Simple profiling: How long does an operation
More informationImageNet Classification with Deep Convolutional Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationMIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA
MIXED PRECISION TRAINING OF NEURAL NETWORKS Carl Case, Senior Architect, NVIDIA OUTLINE 1. What is mixed precision training with FP16? 2. Considerations and methodology for mixed precision training 3.
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationMocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT
Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision:
More informationGPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13
GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning
More informationTraining Deep Neural Networks (in parallel)
Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as
More informationA Performance Analysis Framework for Exploiting GPU Microarchitectural Capability
A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability Keren Zhou, Guangming Tan, Xiuxia Zhang, Chaowei Wang, Ninghui Sun Institute of Computing Technology, Chinese Academy of
More informationvdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design
Memory allocation size (MB) Max layer-wise usage vdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design Minsoo Rhu Natalia Gimelshein Jason Clemons Arslan Zulfiqar
More informationCharacterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager
Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance
More informationScalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism
Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM
More informationHardware and Software. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 6-1
Lecture 6: Hardware and Software Lecture 6-1 Administrative Assignment 1 was due yesterday. Assignment 2 is out, due Wed May 1. Project proposal due Wed April 24. Project-only office hours leading up to
More informationDEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM
DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM AGENDA 1 Introduction to Deep Learning 2 What is DIGITS 3 How to use DIGITS Practical DEEP LEARNING Examples Image Classification, Object Detection,
More informationSPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017
SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Naïve Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent
More informationDeep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper
Deep Convolutional Neural Networks Nov. 20th, 2015 Bruce Draper Background: Fully-connected single layer neural networks Feed-forward classification Trained through back-propagation Example Computer Vision
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationEnabling the future of Artificial intelligence
Enabling the future of Artificial intelligence Contents AI Overview Intel Nervana AI products Hardware Software Intel Nervana Deep Learning Platform Learn more - Intel Nervana AI Academy Artificial Intelligence,
More informationTraining Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL
Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL 1 THIS TALK Using mixed precision and Volta your networks can be: 1. 2-4x faster 2. half the
More informationIndex. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015
ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start
More informationDeep Learning for Computer Vision II
IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L
More informationSVM multiclass classification in 10 steps 17/32
SVM multiclass classification in 10 steps import numpy as np # load digits dataset from sklearn import datasets digits = datasets. load_digits () # define training set size n_samples = len ( digits. images
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationDeep learning in MATLAB From Concept to CUDA Code
Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationConvolutional Neural Networks
NPFL114, Lecture 4 Convolutional Neural Networks Milan Straka March 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise
More informationDeep Learning for Computer Vision with MATLAB By Jon Cherrie
Deep Learning for Computer Vision with MATLAB By Jon Cherrie 2015 The MathWorks, Inc. 1 Deep learning is getting a lot of attention "Dahl and his colleagues won $22,000 with a deeplearning system. 'We
More informationLayer-wise Performance Bottleneck Analysis of Deep Neural Networks
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William
More informationMachine Learning. MGS Lecture 3: Deep Learning
Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer
More informationMachine Learning on VMware vsphere with NVIDIA GPUs
Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology
More informationObject recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK
Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK 17 May 2016, Melbourne 24 May 2016, Sydney Werner Scholz, CTO and Head of R&D, XENON Systems Mike Wang, Solutions Architect,
More informationDeep Learning: Transforming Engineering and Science The MathWorks, Inc.
Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA
More informationDeploying Deep Learning Networks to Embedded GPUs and CPUs
Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD Senior Application Engineer, Computer Vision 2015 The MathWorks, Inc. 1 MATLAB Deep Learning Framework Access Data Design + Train
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationCode Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:
Code Mania 2019 Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python: 1. Introduction to Artificial Intelligence 2. Introduction to python programming and Environment
More informationDeep Neural Networks:
Deep Neural Networks: Part II Convolutional Neural Network (CNN) Yuan-Kai Wang, 2016 Web site of this course: http://pattern-recognition.weebly.com source: CNN for ImageClassification, by S. Lazebnik,
More informationParallel Deep Network Training
Lecture 26: Parallel Deep Network Training Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Speech Debelle Finish This Album (Speech Therapy) Eat your veggies and study
More informationDeep Learning Frameworks with Spark and GPUs
Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationGPUS FOR NGVLA. M Clark, April 2015
S FOR NGVLA M Clark, April 2015 GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing 2 What is a? Tesla K40
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationScaling Deep Learning on Multiple In-Memory Processors
Scaling Deep Learning on Multiple In-Memory Processors Lifan Xu, Dong Ping Zhang, and Nuwan Jayasena AMD Research, Advanced Micro Devices, Inc. {lifan.xu, dongping.zhang, nuwan.jayasena}@amd.com ABSTRACT
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs Introduction Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University How to.. Process terabytes
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationDeep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity
Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Abstract: This project aims at creating a benchmark for Deep Learning (DL) algorithms
More information