OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

Size: px
Start display at page:

Download "OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi"

Transcription

1 OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015

2 Outline About nervana Optimizing deep learning at assembler level Limited precision for deep learning neon benchmarks 2

3 About nervana A platform for machine intelligence enable deep learning at scale optimized from algorithms to silicon X 3 About Kernels neon Summary

4 Verticals Medical Finance Pharma Oil&Gas Agriculture $ 4 About Kernels neon Summary

5 Verticals Medical Finance Pharma Oil&Gas Agriculture $ Deep learning supplanting traditional approaches everywhere Small improvements have large impact Customers require clear roadmap that scales to growing need. 4 About Kernels neon Summary

6 nervana platform for deep learning train explore deploy nervana framework Data nervana cloud Solutions 5 About Kernels neon Summary

7 nervana platform for deep learning train explore deploy nervana framework Data nervana cloud Solutions GPUs CPUs nervana engine 5 About Kernels neon Summary

8 maxas: a Maxwell Assembler Full control of: register allocation instruction ordering control codes barriers, stall counts Built-in scheduler (optional) Meta-programming 6 About Kernels [ maxas ] neon Summary

9 maxas: a Maxwell Assembler Full control of: register allocation instruction ordering control codes barriers, stall counts Built-in scheduler (optional) Meta-programming Scott Gray 6 About Kernels [ maxas ] neon Summary

10 maxas: a Maxwell Assembler Full control of: register allocation instruction ordering control codes barriers, stall counts Built-in scheduler (optional) Meta-programming Scott Gray See GitHub repo for docs and examples 6 About Kernels [ maxas ] neon Summary

11 ptxas struggles with Instruction Level Parallelism 25" 20" Distribu4on&of&Number&of&Instruc4ons&Between&LDS& and&dependant&ffma&operands& Count& 15" 10" Bad Good ptx" cublas" 5" 0" 1" 6" 11" 16" 21" 26" 31" 36" 41" 46" 51" 56" 61" 66" 71" 76" 81" 86" 91" 96" 101" 106" 111" 116" 121" 126" 131" 136" 141" 146" 151" FFMA&Line#&.&LDS&Line#& courtesy Scott Gray 7 About Kernels [ maxas ] neon Summary

12 Easy register allocation through maxas c Register banking for outer products c = a b t a b 8 About Kernels [ maxas ] neon Summary

13 Example GEMM code in maxas 9 About Kernels [ maxas ] neon Summary

14 Example GEMM code in maxas Load from shared 9 About Kernels [ maxas ] neon Summary

15 Example GEMM code in maxas Load from shared Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

16 Example GEMM code in maxas Load from shared Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

17 Example GEMM code in maxas Dual issue instr. Load from shared Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

18 Example GEMM code in maxas Dual issue instr. Load from shared Set barrier Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

19 Example GEMM code in maxas Dual issue instr. Load from shared Set barrier Barrier sync Control Codes Fused fp32 multiply add 9 About Kernels [ maxas ] neon Summary

20 Convolution kernels for deep learning Input Filters Output H R * S = K P C W S R K Q C H x W R x S K P x Q N Number of input channels Input spatial dims Filter spatial dims Number of filters Output spatial dims Mini-batch dim (not shown) 10 About Kernels [ Convolution ] neon Summary

21 Access patterns for matrix lowering Convolution kernels: 11 About Kernels [ Convolution ] neon Summary

22 Access patterns for matrix lowering Convolution kernels: fprop 11 About Kernels [ Convolution ] neon Summary

23 Access patterns for matrix lowering Convolution kernels: bprop Backprop(Step(1( δ 1 ( P(=(Q(=(2( K(=(2( C(=(3( K(=(2( R(=(S(=(2(( δ 0 ( N(=(3( C(=(3( H(=(W(=(3( ( 11 About Kernels [ Convolution ] neon Summary

24 Access patterns for matrix lowering Convolution kernels: update Backprop(Step(2( (Weight(Updates( δ 1 ( Output(of(the(previous(layer ( P(=(Q(=(2( K(=(2( Weight(updates( C(=(3( K(=(2( R(=(S(=(2(( N(=(3( C(=(3( H(=(W(=(3( 11 About Kernels [ Convolution ] neon Summary

25 Deep learning with low precision works 12 About Kernels [ Limited Precision ] neon Summary

26 Deep learning with low precision works Improving the speed of neural networks on CPUs Vincent Vanhoucke Google, Inc. Mountain View, CA Andrew Senior Google, Inc. New York, NY Mark Z. Mao Google, Inc. Mountain View, CA Abstract Recent advances in deep learning have made the use of large, deep neural networks with tens of millions of parameters suitable for a number of applications that require real-time processing. The sheer size of these networks can represent a challenging computational burden, even for modern CPUs. For this reason, GPUs are routinely used instead to train and run such networks. This paper is a tutorial for students and researchers on some of the techniques that can be used to reduce this computational cost considerably on modern x86 CPUs. We emphasize data layout, batching of the computation, the use of SSE2 instructions, and particularly leverage SSSE3 and SSE4 fixed-point instructions which provide a 3 improvement over an optimized floating-point baseline. We use speech recognition as an example task, and show that a real-time hybrid hidden Markov model / neural network (HMM/NN) large vocabulary system can be built with a 10 speedup over an unoptimized baseline and a 4 speedup over an aggressively optimized floating-point baseline at no cost in accuracy. The techniques described extend readily to neural network training and provide an effective alternative to the use of specialized hardware. 12 About Kernels [ Limited Precision ] neon Summary

27 Deep learning with low precision works LOW PRECISION ARITHMETIC FOR DEEP LEARNING Matthieu Courbariaux & Jean-Pierre David Department of Electrical Engineering École Polytechnique de Montréal Montréal, QC H3T 1J4, Canada Yoshua Bengio Department of Computer Science and Operations Research Université de Montréal Montréal, QC H3T 1J4, Canada ABSTRACT We simulate the training of a set of state of the art neural networks, the Maxout networks (Goodfellow et al., 2013a), on three benchmark datasets: the MNIST, CIFAR10 and SVHN, with three distinct arithmetics: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those arithmetics, we assess the impact of the precision of the computations on the final error of the training. We find that very low precision computation is sufficient not just for running trained networks but also for training them. For example, almost state-of-the-art results were obtained on most datasets with around 10 bits for computing activations and gradients, and 12 bits for storing updated parameters. 12 About Kernels [ Limited Precision ] neon Summary

28 Deep learning with low precision works Deep Learning with Limited Numerical Precision Suyog Gupta Ankur Agrawal Kailash Gopalakrishnan IBM T. J. Watson Research Center, Yorktown Heights, NY Pritish Narayanan IBM Almaden Research Center, San Jose, CA Abstract Training of large-scale deep neural networks is often constrained by the available computational resources. We study the e ect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network s behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-e cient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. 12 About Kernels [ Limited Precision ] neon Summary

29 neon: nervana python deep learning library 13 About Kernels neon Summary

30 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism 13 About Kernels neon Summary

31 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models 13 About Kernels neon Summary

32 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud 13 About Kernels neon Summary

33 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud Supports multiple backends nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon) { } 13 About Kernels neon Summary

34 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud Supports multiple backends Multiple limited precision options nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon) { } 13 About Kernels neon Summary

35 neon: nervana python deep learning library User-friendly, extensible, abstracts parallelism Support for many deep learning models Interface to nervana cloud Supports multiple backends Multiple limited precision options Optimized for Maxwell at assembler level nervana engine GPU cluster CPU cluster (eg. Cray XC30) Xeon Phi cluster (soon) { } 13 About Kernels neon Summary

36 neon: easy model configuration 14 About Kernels neon Summary

37 neon: easy model configuration Dataset 14 About Kernels neon Summary

38 neon: easy model configuration Dataset Weight initialization 14 About Kernels neon Summary

39 neon: easy model configuration Dataset Weight initialization Learning rule 14 About Kernels neon Summary

40 neon: easy model configuration Dataset Weight initialization Learning rule Model layers and cost 14 About Kernels neon Summary

41 neon experiments in fp16/32 15 About Kernels neon Summary

42 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format 15 About Kernels neon Summary

43 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 15 About Kernels neon Summary

44 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection 15 About Kernels neon Summary

45 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection Python element-wise operations auto-compiled into kernels 15 About Kernels neon Summary

46 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection Python element-wise operations auto-compiled into kernels fp16 accumulations done carefully to minimize errors 15 About Kernels neon Summary

47 neon experiments in fp16/32 Use 16-bit floating point (fp16) as memory format Multiply-and-adds use fp32 Kernel support for: GEMM Stochastic rounding Dropout / maxout Conv {f,b}prop, update Max pooling Statistics collection Python element-wise operations auto-compiled into kernels fp16 accumulations done carefully to minimize errors Working with collaborators (Baidu, Bengio lab) to improve 15 About Kernels neon Summary

48 fp16/32 accuracy No accuracy loss going from fp32 to fp16 fp32 Count Error (%) distribution over 25 reruns Error (%) distribution over 25 runs 16 About Kernels neon Summary

49 fp16/32 accuracy No accuracy loss going from fp32 to fp16 fp16 fp32 Count Error (%) distribution over 25 reruns Error (%) distribution over 25 runs 16 About Kernels neon Summary

50 fp16/32 accuracy No accuracy loss going from fp32 to fp16 fp 16 sto fp16 fp32 Count Error (%) distribution over 25 reruns Error (%) distribution over 25 runs 16 About Kernels neon Summary

51 Speed benchmarks 1 : fp16 vs others 600 neon fp16 neon Cudaconvnet2 Torch7 cudnn* cudanet 500 Time per layer (ms) layers convolutional forward pass, layers, 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 *2 nd, 3 rd layer don t fit on a 4GB card 1 Soumith Chintala, github.com/soumith/convnet-benchmarks 17 About Kernels neon Summary

52 Speed benchmarks 1 : fp16 vs fp neon fp16 neon Cudaconvnet2 cudanet Torch7 cudnn * 500 Time per layer (ms) layers convolutional forward pass, layers, 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 1 Soumith Chintala, github.com/soumith/convnet-benchmarks *some layers do not fit on a 4GB card X About Kernels neon Summary

53 Speed benchmarks 1 : fp16 vs fp neon fp16 neon Cuda- cudanet convnet2 Torch7 cudnn * 500 Time per layer (ms) layers convolutional forward pass, layers, 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 1 Soumith Chintala, github.com/soumith/convnet-benchmarks *some layers do not fit on a 4GB card X About Kernels neon Summary

54 Speed benchmarks 1 : fp16 vs fp neon fp16 neon Cuda- cudanet convnet2 Torch7 cudnn* 500 Time Time per per layer layer (ms) (ms) convolutional layers forward pass, layers, 5 5 backward forward pass and backward pass Lower times are better. Benchmarks on GTX980 1 Soumith Chintala, github.com/soumith/convnet-benchmarks *some layers do not fit on a 4GB card X About Kernels neon Summary

55 Benchmarks 1 show 2x performance Raw numbers (averaged over 10 runs) Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

56 Benchmarks 1 show 2x performance Alexnet Raw numbers (averaged over 10 runs) Avg(10) fprop: msecs gflops Avg(10) bprop: msecs gflops Avg(10) total: msecs gflops Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

57 Benchmarks 1 show 2x performance Overfeat Raw numbers (averaged over 10 runs) Avg(10) fprop: msecs gflops Avg(10) bprop: msecs gflops Avg(10) total: msecs gflops Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

58 Benchmarks 1 show 2x performance VGG (N=64) Raw numbers (averaged over 10 runs) Avg(10) fprop: msecs gflops Avg(10) bprop: msecs gflops Avg(10) total: msecs gflops Maximum practical peak is 4700 gflops. More than double speed 2 with half memory storage / bandwidth. Time / (s) Alexnet Cuda-Convnet Speed (TFLOPS) Alexnet fp16 About Kernels neon Summary 1 Using conventions here: Soumith Chintala, github.com/ soumith/convnetbenchmarks 2 Numbers are relative to Titan Black (Kepler architecture)

59 Summary 19 About Kernels neon Summary

60 Summary neon: User-friendly python library 19 About Kernels neon Summary

61 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning 19 About Kernels neon Summary

62 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU 19 About Kernels neon Summary

63 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU Limited precision allows for larger models 19 About Kernels neon Summary

64 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU Limited precision allows for larger models Toolbox for exploring numerical representations 19 About Kernels neon Summary

65 Summary neon: User-friendly python library maxas: Powerful tool for optimizing deep learning Fast performance, full utilization of GPU Limited precision allows for larger models Toolbox for exploring numerical representations 19 About Kernels neon Summary

66 GTC 2015 Contact us at We are hiring! Cloud engineers GPU experts machine learning engineers software engineers Sign up to try neon, our deep learning library. We can help solve your problem. 20 About Kernels neon Summary

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer DEEP NEURAL NETWORKS AND GPUS Julie Bernauer GPU Computing GPU Computing Run Computations on GPUs x86 CUDA Framework to Program NVIDIA GPUs A simple sum of two vectors (arrays) in C void vector_add(int

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

A performance comparison of Deep Learning frameworks on KNL

A performance comparison of Deep Learning frameworks on KNL A performance comparison of Deep Learning frameworks on KNL R. Zanella, G. Fiameni, M. Rorro Middleware, Data Management - SCAI - CINECA IXPUG Bologna, March 5, 2018 Table of Contents 1. Problem description

More information

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Normalized execution time Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs Chao Li # Yi Yang* Min Feng* Srimat Chakradhar* Huiyang Zhou # # Department of Electrical and Computer

More information

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA TOPICS COVERED Convolutional Networks Deep Learning Use Cases GPUs cudnn 2 MACHINE LEARNING! Training! Train the model from supervised

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

arxiv: v5 [cs.lg] 23 Sep 2015

arxiv: v5 [cs.lg] 23 Sep 2015 TRAINING DEEP NEURAL NETWORKS WITH LOW PRECISION MULTIPLICATIONS Matthieu Courbariaux & Jean-Pierre David École Polytechnique de Montréal {matthieu.courbariaux,jean-pierre.david}@polymtl.ca arxiv:1412.7024v5

More information

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS April 4-7, 2016 Silicon Valley OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS Jeremy Appleyard, 7 April 2016 RECURRENT NEURAL NETWORKS Output is fed into input Perform the same operation repeatedly

More information

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Ram Kokku 2018 The MathWorks, Inc. 1 GPUs and CUDA programming faster Performance CUDA OpenCL C/C++ GPU Coder MATLAB Python Ease of programming

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 7. Image Processing COMP9444 17s2 Image Processing 1 Outline Image Datasets and Tasks Convolution in Detail AlexNet Weight Initialization Batch Normalization

More information

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers Deep Learning Workshop Nov. 20, 2015 Andrew Fishberg, Rowan Zellers Why deep learning? The ImageNet Challenge Goal: image classification with 1000 categories Top 5 error rate of 15%. Krizhevsky, Alex,

More information

Implementing Deep Learning for Video Analytics on Tegra X1.

Implementing Deep Learning for Video Analytics on Tegra X1. Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

Fast Algorithms for Convolutional Neural Networks

Fast Algorithms for Convolutional Neural Networks Fast Algorithms for Convolutional Neural Networks Andrew Lavin alavin@acm.org Scott Gray Nervana Systems sgray@nervanasys.com Abstract Deep convolutional neural networks take GPU-days of computation to

More information

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla DEEP LEARNING REVIEW Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature 2015 -Presented by Divya Chitimalla What is deep learning Deep learning allows computational models that are composed of multiple

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

Half Precision Benchmarking for HPC

Half Precision Benchmarking for HPC PiotrLuszczek Half Precision Benchmarking for HPC S7676 May 11, 2017 GPU Technology Conference, San Jose, CA, USA 1 / 18 May 11, 2017 GPU Technology Conference, San Jose, CA, USA 2 / 18 Major Floating

More information

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC.

IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC. IMPLEMENTING DEEP LEARNING USING CUDNN 이예하 VUNO INC. CONTENTS Deep Learning Review Implementation on GPU using cudnn Optimization Issues Introduction to VUNO-Net DEEP LEARNING REVIEW BRIEF HISTORY OF NEURAL

More information

GPU-Accelerated Deep Learning

GPU-Accelerated Deep Learning GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

Accelerating cublas/cudnn using Input-Aware Auto-Tuning

Accelerating cublas/cudnn using Input-Aware Auto-Tuning Accelerating cublas/cudnn using Input-Aware Auto-Tuning The ISAAC library Philippe Tillet Harvard University Introduction cublas does not always achieve peak performance: (M, N, K) 1 = (4096, 4096, 4096):

More information

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm

More information

Scaling Deep Learning. Bryan

Scaling Deep Learning. Bryan Scaling Deep Learning @ctnzr What do we want AI to do? Guide us to content Keep us organized Help us find things Help us communicate 帮助我们沟通 Drive us to work Serve drinks? Image Q&A Baidu IDL Sample questions

More information

Keras: Handwritten Digit Recognition using MNIST Dataset

Keras: Handwritten Digit Recognition using MNIST Dataset Keras: Handwritten Digit Recognition using MNIST Dataset IIT PATNA January 31, 2018 1 / 30 OUTLINE 1 Keras: Introduction 2 Installing Keras 3 Keras: Building, Testing, Improving A Simple Network 2 / 30

More information

Effectively Scaling Deep Learning Frameworks

Effectively Scaling Deep Learning Frameworks Effectively Scaling Deep Learning Frameworks (To 40 GPUs and Beyond) Welcome everyone! I m excited to be here today and get the opportunity to present some of the work that we ve been doing at SVAIL, the

More information

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD. Deep Learning 861.061 Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD asan.agibetov@meduniwien.ac.at Medical University of Vienna Center for Medical Statistics,

More information

Deep Learning with Torch

Deep Learning with Torch Deep Learning with Torch The good, the bad, the ugly since 2002 Jimmy Ba jimmy@psi.utoronto.ca What is Torch? Year 2012 Google Answer: Torch7 provides a Matlab-like environment for state-of-the-art machine

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

CNN optimization. Rassadin A

CNN optimization. Rassadin A CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage

More information

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2017 Deep Neural

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Deep Learning and Its Applications

Deep Learning and Its Applications Convolutional Neural Network and Its Application in Image Recognition Oct 28, 2016 Outline 1 A Motivating Example 2 The Convolutional Neural Network (CNN) Model 3 Training the CNN Model 4 Issues and Recent

More information

Profiling GPU Code. Jeremy Appleyard, February 2016

Profiling GPU Code. Jeremy Appleyard, February 2016 Profiling GPU Code Jeremy Appleyard, February 2016 What is Profiling? Measuring Performance Measuring application performance Usually the aim is to reduce runtime Simple profiling: How long does an operation

More information

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky Ilya Sutskever Geoffrey Hinton University of Toronto Canada Paper with same name to appear in NIPS 2012 Main idea Architecture

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA

MIXED PRECISION TRAINING OF NEURAL NETWORKS. Carl Case, Senior Architect, NVIDIA MIXED PRECISION TRAINING OF NEURAL NETWORKS Carl Case, Senior Architect, NVIDIA OUTLINE 1. What is mixed precision training with FP16? 2. Considerations and methodology for mixed precision training 3.

More information

Lecture 1: Introduction and Computational Thinking

Lecture 1: Introduction and Computational Thinking PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision:

More information

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13 GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning

More information

Training Deep Neural Networks (in parallel)

Training Deep Neural Networks (in parallel) Lecture 9: Training Deep Neural Networks (in parallel) Visual Computing Systems How would you describe this professor? Easy? Mean? Boring? Nerdy? Professor classification task Classifies professors as

More information

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability Keren Zhou, Guangming Tan, Xiuxia Zhang, Chaowei Wang, Ninghui Sun Institute of Computing Technology, Chinese Academy of

More information

vdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

vdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design Memory allocation size (MB) Max layer-wise usage vdnn: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design Minsoo Rhu Natalia Gimelshein Jason Clemons Arslan Zulfiqar

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism

Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism Jiecao Yu 1, Andrew Lukefahr 1, David Palframan 2, Ganesh Dasika 2, Reetuparna Das 1, Scott Mahlke 1 1 University of Michigan 2 ARM

More information

Hardware and Software. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 6-1

Hardware and Software. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 6-1 Lecture 6: Hardware and Software Lecture 6-1 Administrative Assignment 1 was due yesterday. Assignment 2 is out, due Wed May 1. Project proposal due Wed April 24. Project-only office hours leading up to

More information

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM

DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM DEEP LEARNING AND DIGITS DEEP LEARNING GPU TRAINING SYSTEM AGENDA 1 Introduction to Deep Learning 2 What is DIGITS 3 How to use DIGITS Practical DEEP LEARNING Examples Image Classification, Object Detection,

More information

SPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017

SPARSE PERSISTENT RNN. Feiwen Zhu, 5/9/2017 SPARSE PERSISTENT RNN Feiwen Zhu, 5/9/2017 Motivation Introduction Algorithm AGENDA Naïve Implementation Optimizations Experiments Conclusion 2 MOTIVATION Exploit sparsity for faster, larger networks Recurrent

More information

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper Deep Convolutional Neural Networks Nov. 20th, 2015 Bruce Draper Background: Fully-connected single layer neural networks Feed-forward classification Trained through back-propagation Example Computer Vision

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Enabling the future of Artificial intelligence

Enabling the future of Artificial intelligence Enabling the future of Artificial intelligence Contents AI Overview Intel Nervana AI products Hardware Software Intel Nervana Deep Learning Platform Learn more - Intel Nervana AI Academy Artificial Intelligence,

More information

Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL

Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL 1 THIS TALK Using mixed precision and Volta your networks can be: 1. 2-4x faster 2. half the

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start

More information

Deep Learning for Computer Vision II

Deep Learning for Computer Vision II IIIT Hyderabad Deep Learning for Computer Vision II C. V. Jawahar Paradigm Shift Feature Extraction (SIFT, HoG, ) Part Models / Encoding Classifier Sparrow Feature Learning Classifier Sparrow L 1 L 2 L

More information

SVM multiclass classification in 10 steps 17/32

SVM multiclass classification in 10 steps 17/32 SVM multiclass classification in 10 steps import numpy as np # load digits dataset from sklearn import datasets digits = datasets. load_digits () # define training set size n_samples = len ( digits. images

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Deep learning in MATLAB From Concept to CUDA Code

Deep learning in MATLAB From Concept to CUDA Code Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Convolutional Neural Networks

Convolutional Neural Networks NPFL114, Lecture 4 Convolutional Neural Networks Milan Straka March 25, 2019 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise

More information

Deep Learning for Computer Vision with MATLAB By Jon Cherrie

Deep Learning for Computer Vision with MATLAB By Jon Cherrie Deep Learning for Computer Vision with MATLAB By Jon Cherrie 2015 The MathWorks, Inc. 1 Deep learning is getting a lot of attention "Dahl and his colleagues won $22,000 with a deeplearning system. 'We

More information

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William

More information

Machine Learning. MGS Lecture 3: Deep Learning

Machine Learning. MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ Machine Learning MGS Lecture 3: Deep Learning Dr Michel F. Valstar http://cs.nott.ac.uk/~mfv/ WHAT IS DEEP LEARNING? Shallow network: Only one hidden layer

More information

Machine Learning on VMware vsphere with NVIDIA GPUs

Machine Learning on VMware vsphere with NVIDIA GPUs Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology

More information

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK

Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK Object recognition and computer vision using MATLAB and NVIDIA Deep Learning SDK 17 May 2016, Melbourne 24 May 2016, Sydney Werner Scholz, CTO and Head of R&D, XENON Systems Mike Wang, Solutions Architect,

More information

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deep Learning: Transforming Engineering and Science The MathWorks, Inc. Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA

More information

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Deploying Deep Learning Networks to Embedded GPUs and CPUs Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD Senior Application Engineer, Computer Vision 2015 The MathWorks, Inc. 1 MATLAB Deep Learning Framework Access Data Design + Train

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and

More information

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python:

Code Mania Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python: Code Mania 2019 Artificial Intelligence: a. Module - 1: Introduction to Artificial intelligence and Python: 1. Introduction to Artificial Intelligence 2. Introduction to python programming and Environment

More information

Deep Neural Networks:

Deep Neural Networks: Deep Neural Networks: Part II Convolutional Neural Network (CNN) Yuan-Kai Wang, 2016 Web site of this course: http://pattern-recognition.weebly.com source: CNN for ImageClassification, by S. Lazebnik,

More information

Parallel Deep Network Training

Parallel Deep Network Training Lecture 26: Parallel Deep Network Training Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Speech Debelle Finish This Album (Speech Therapy) Eat your veggies and study

More information

Deep Learning Frameworks with Spark and GPUs

Deep Learning Frameworks with Spark and GPUs Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

GPUS FOR NGVLA. M Clark, April 2015

GPUS FOR NGVLA. M Clark, April 2015 S FOR NGVLA M Clark, April 2015 GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing 2 What is a? Tesla K40

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Scaling Deep Learning on Multiple In-Memory Processors

Scaling Deep Learning on Multiple In-Memory Processors Scaling Deep Learning on Multiple In-Memory Processors Lifan Xu, Dong Ping Zhang, and Nuwan Jayasena AMD Research, Advanced Micro Devices, Inc. {lifan.xu, dongping.zhang, nuwan.jayasena}@amd.com ABSTRACT

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs Introduction Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University How to.. Process terabytes

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity

Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Deep Learning Benchmarks Mumtaz Vauhkonen, Quaizar Vohra, Saurabh Madaan Collaboration with Adam Coates, Stanford Unviersity Abstract: This project aims at creating a benchmark for Deep Learning (DL) algorithms

More information