End to End Optimization Stack for Deep Learning

Size: px
Start display at page:

Download "End to End Optimization Stack for Deep Learning"

Transcription

1 End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington

2 Collaborators University of Washington AWS AI Team Tianqi Chen Thierry Moreau Haichen Shen Ziheng Jiang ML, Software Stack Hardware Stack GPU ARM, NNVM pipeline and many more contributors in the DMLC community Carlos Guestrin Luis Ceze Arvind Krishnamurthy

3 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

4 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

5 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

6 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware Built a new accelerator

7 Deep Learning System Research is Exciting but Hard Frameworks CNTK Need entire software stack on top of it! Computational graph Layout transformation Quantization Operator kernel optimization Benchmarking. Operator Libraries cudnn, NNPack, MKL-DNN Hardware Built a new accelerator

8 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

9 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

10 Deep Learning System Research is Exciting but Hard Frameworks CNTK Data Layout Optimization Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

11 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Data Layout Optimization Operator Fusion Operator Libraries cudnn, NNPack, MKL-DNN Hardware

12 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Data Layout Optimization Operator Fusion Operator Libraries cudnn, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware!

13 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Data Layout Optimization Operator Fusion Serving Operator Libraries cudnn, NNPack, MKL-DNN Need optimized hardware kernel Hardware for each variant, on each hardware!

14 The End to End System Challenge Frameworks Hardware Back-Ends

15 The End to End System Challenge Frameworks Hardware Back-Ends

16 The End to End System Challenge Frameworks Hardware Back-Ends

17 The End to End System Challenge Frameworks Hardware Back-Ends

18 The End to End System Challenge Frameworks Hardware Back-Ends

19 The End to End System Challenge Frameworks Hardware Back-Ends

20 The End to End System Challenge Frameworks Hardware Back-Ends

21 The End to End System Challenge Frameworks Intermediate representation Hardware Back-Ends

22 Computational Graph IR and Remaining Gap Examples: NGraph, XLA, NNVM, DLVM Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends

23 Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion Backends

24 Computational Graph IR and Remaining Gap Computational Graph Auto Differentiation Memory Plan Operator Fusion too many possible choices: precision, layout, fused pattern, device, threading Need a low level IR to express them explicitly Backends

25 TVM: Low Level IR Concise and compact description Explicit control on codegen Framework NNVM Graph Auto Differentiation Memory Plan Ease of deployment Support new hardware backends TVM Hardware backends

26 Tensor Index Expression Declaration Compute C = dot(a, B.T) import tvm m, n, h = tvm.var('m'), tvm.var('n'), tvm.var('h') A = tvm.placeholder((m, h), name='a') Inputs B = tvm.placeholder((n, h), name= B') k = tvm.reduce_axis((0, h), name= k') C = tvm.compute((m, n), lambda i, j: tvm.sum(a[i, k] * B[j, k], axis=k)) Shape of C Computation Rule

27 Challenge: Hardware Diversities IR

28 Challenge: Hardware Diversities CPU GPU Accelerators IR

29 Challenge: Hardware Diversities CPU GPU Accelerators IR Memory subsystem L3 L2 FIFO SM SM Unified L2 L2 TX/L1 TX/L1 Buffer Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed

30 Challenge: Hardware Diversities CPU GPU Accelerators IR Memory subsystem L3 L2 FIFO SM SM Unified L2 L2 TX/L1 TX/L1 Buffer Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor

31 Challenge: Hardware Diversities CPU GPU Accelerators IR Memory subsystem L3 L2 FIFO SM SM Unified L2 L2 TX/L1 TX/L1 Buffer Acc L1D L1I L1D L1I RF RF RF RF implicitly managed mixed explicitly managed Compute primitives scalar vector tensor Data type fp32 fp16 int8

32 Unified Schedule Optimizations for Hardwares Algorithm described in IR Lowering Generated code (LLVM, CUDA, OpenCL )

33 Unified Schedule Optimizations for Hardwares Algorithm described in IR Scheduling Optimization Lowering Generated code (LLVM, CUDA, OpenCL )

34 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout Lowering Generated code (LLVM, CUDA, OpenCL )

35 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout ( ) Tiling Lowering Generated code (LLVM, CUDA, OpenCL )

36 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout ( ) Tiling Lowering ( ) Thread cooperation Generated code (LLVM, CUDA, OpenCL )

37 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout ( ) Tiling Lowering ( ) Thread cooperation ( ) Latency hiding Generated code (LLVM, CUDA, OpenCL )

38 Unified Schedule Optimizations for Hardwares Scheduling Optimizations Algorithm described in IR Scheduling Optimization ( ) Data layout ( ) Tiling Lowering ( ) Thread cooperation ( ) Latency hiding ( ) Tensorization Generated code (LLVM, CUDA, OpenCL )

39 Separation of Compilation and Deployment Compilation Stack TVM Runtimes Framework Frontends Deploy NNVM TVM TVM Graph Module Heavy optimizations Lightweight, 300 to 600 KB

40 Remote Execution and Profiling Devices with TVM Runtime Server with TVM Compiler TVM RPC

41 Performance Portable against state of art 8 6 K80, Baseline MXNet with cudnn auto tune enabled One grad student month 1.2x MXNet NNVM Compiler Raspberry Pi 3 Baseline: MXNet with OpenBLAS and NNPack Two undergrad weeks MXNet NNVM Compiler Time cost(ms) Time cost(ms) 4 1.2x x 11.5x 0 ResNet18 MobileNet 0 ResNet18 MobileNet Credit: Leyuan Wang(AWS/UCDavis), Yuwei Hu(TuSimple), Zheng Jiang(AWS/FDU)

42 Coming Soon: Target New Accelerators Tensorization Latency Hiding FPGA Example for building new hardware backend Open-source soon

43 NNVM Compiler: Open Compiler for AI Systems Caffe Keras MXNet PyTorch Caffe2 CNTK CoreML NNVM ONNX Graph Optimizations External Support Supported Work in progress TVM TVM Primitives Joint Work with AWS AI Team and DMLC community More hardware backends OpenCL LLVM Metal CUDA X86 AMDGPUs ARM Javascript/WASM

44 Deep Learning System Research is Exciting but Hard Frameworks CNTK Computational graph Operator Libraries cudnn, NNPack, MKL-DNN Hardware

45 Deep Learning System Research is Just Exciting Frameworks CNTK NNVM My new optimizations works on all platforms! Graph Optimizations TVM TVM Primitives I can program my new accelerators from python :) Hardware

46 Deep Learning System Research is Just Exciting Frameworks CNTK NNVM My new optimizations works on all platforms! Graph Optimizations TVM TVM Primitives I can program my new accelerators from python :) Hardware You can be part of it!

Automatic Code Generation TVM Stack

Automatic Code Generation TVM Stack Automatic Code Generation TVM Stack CSE 599W Spring TVM stack is an active project by saml.cs.washington.edu and many partners in the open source community The Gap between Framework and Hardware Frameworks

More information

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos

More information

TVM Stack Overview. Tianqi Chen

TVM Stack Overview. Tianqi Chen TVM Stack Overview Tianqi Chen Beginning of TVM Story Acclerator Beginning of TVM Story Acclerator Beginning of TVM Story Beginning of TVM Story Acclerator // Pseudo-code for convolution program for the

More information

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018 Relay: a high level differentiable IR Jared Roesch TVMConf December 12th, 2018!1 This represents months of joint work with lots of great folks:!2 TVM Stack Optimization Relay High-Level Differentiable

More information

Optimizing CNN Inference on CPUs

Optimizing CNN Inference on CPUs Optimizing CNN Inference on CPUs Yizhi Liu, Yao Wang, Yida Wang With others in AWS AI Agenda Deep learning inference optimization Optimization on Intel CPUs Evaluation Make DL inference easier and faster

More information

AutoTVM & Device Fleet

AutoTVM & Device Fleet AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware Learning to Optimize Tensor Programs Frameworks High-level data flow graph

More information

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen and Thierry Moreau, University of Washington; Ziheng Jiang, University of Washington, AWS; Lianmin Zheng, Shanghai Jiao Tong

More information

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro HKG18-417 OpenCL Support by NNVM & TVM Jammy Zhou - Linaro Agenda OpenCL Overview OpenCL in NNVM & TVM Current Status OpenCL Introduction Open Computing Language Open standard maintained by Khronos with

More information

Lecture 6: Optimize for Hardware Backends. CSE599G1: Spring 2017

Lecture 6: Optimize for Hardware Backends. CSE599G1: Spring 2017 Lecture 6: Optimize for Hardware Backends CSE599G1: Spring 2017 Where are we High level Packages User API Programming API Gradient Calculation (Differentiation API) System Components Computational Graph

More information

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018 VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor

More information

arxiv: v2 [cs.lg] 20 May 2018

arxiv: v2 [cs.lg] 20 May 2018 TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen 1, Thierry Moreau 1, Ziheng Jiang 2, Lianmin Zheng 3, Eddie Yan 1 Meghan Cowan 1, Haichen Shen 1, Leyuan Wang 4, Yuwei Hu

More information

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Scalable Distributed Training with Parameter Hub: a whirlwind tour Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC

More information

Wu Zhiwen.

Wu Zhiwen. Wu Zhiwen zhiwen.wu@intel.com Agenda Background information OpenCV DNN module OpenCL acceleration Vulkan backend Sample 2 What is OpenCV? Open Source Compute Vision (OpenCV) library 2500+ Optimized algorithms

More information

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Deploying Deep Learning Networks to Embedded GPUs and CPUs Deploying Deep Learning Networks to Embedded GPUs and CPUs Rishu Gupta, PhD Senior Application Engineer, Computer Vision 2015 The MathWorks, Inc. 1 MATLAB Deep Learning Framework Access Data Design + Train

More information

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Pierre Nowodzienski Engineer pierre.nowodzienski@mathworks.fr 2018 The MathWorks, Inc. 1 From Data to Business value Make decisions Get

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Deep Learning Compiler

Deep Learning Compiler Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Neural Network Exchange Format

Neural Network Exchange Format Copyright Khronos Group 2017 - Page 1 Neural Network Exchange Format Deploying Trained Networks to Inference Engines Viktor Gyenes, specification editor Copyright Khronos Group 2017 - Page 2 Outlook The

More information

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Ram Kokku 2018 The MathWorks, Inc. 1 GPUs and CUDA programming faster Performance CUDA OpenCL C/C++ GPU Coder MATLAB Python Ease of programming

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

Snapdragon NPE Overview

Snapdragon NPE Overview March 2018 Linaro Connect Hong Kong Snapdragon NPE Overview Mark Charlebois Director, Engineering Qualcomm Technologies, Inc. Caffe2 Snapdragon Neural Processing Engine Efficient execution on Snapdragon

More information

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018 Copyright Khronos Group 2018 - Page 1 Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc peter.mcguinness@gobrach.com May 2018 Khronos Mission E.g. OpenGL ES provides 3D

More information

arxiv: v1 [cs.lg] 11 Jul 2018

arxiv: v1 [cs.lg] 11 Jul 2018 VTA: An Open Hardware-Software Stack for Deep Learning Thierry Moreau, Tianqi Chen, Ziheng Jiang, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy Paul G. Allen School of Computer Science & Engineering,

More information

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS S8497 - INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS Chris Lamb CUDA and NGC Engineering, NVIDIA John Barco NGC Product Management, NVIDIA NVIDIA GPU Cloud (NGC) overview AGENDA Using NGC

More information

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling AMD Radeon Open Compute Platform Felix Kuehling ROCM PLATFORM ON LINUX Compiler Front End AMDGPU Driver Enabled with ROCm GCN Assembly Device LLVM Compiler (GCN) LLVM Opt Passes GCN Target Host LLVM Compiler

More information

Standards for Vision Processing and Neural Networks

Standards for Vision Processing and Neural Networks Copyright Khronos Group 2017 - Page 1 Standards for Vision Processing and Neural Networks Radhakrishna Giduthuri, AMD radha.giduthuri@ieee.org Agenda Why we need a standard? Khronos NNEF Khronos OpenVX

More information

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017 Deep Learning Frameworks COSC 7336: Advanced Natural Language Processing Fall 2017 Today s lecture Deep learning software overview TensorFlow Keras Practical Graphical Processing Unit (GPU) From graphical

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI Overview Unparalleled Value Product Portfolio Software Platform From Desk to Data Center to Cloud Summary AI researchers depend on computing performance to gain

More information

Managing Deep Learning Workflows

Managing Deep Learning Workflows Managing Deep Learning Workflows Deep Learning on AWS Batch treske@amazon.de September 2017 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Business Understanding Data Understanding

More information

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018 Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks

More information

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13 GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Introduction to Deep Learning in Signal Processing & Communications with MATLAB Introduction to Deep Learning in Signal Processing & Communications with MATLAB Dr. Amod Anandkumar Pallavi Kar Application Engineering Group, Mathworks India 2019 The MathWorks, Inc. 1 Different Types

More information

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018 Speculations about Computer Architecture in Next Three Years shuchang.zhou@gmail.com Jan. 20, 2018 About me https://zsc.github.io/ Source-to-source transformation Cache simulation Compiler Optimization

More information

gpucc: An Open-Source GPGPU Compiler

gpucc: An Open-Source GPGPU Compiler gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Jim Heaton Sr. FAE Deep Learning explores the study of algorithms that can learn from and make predictions on data Deep Learning is Re-defining Many Applications Cloud Acceleration

More information

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Deep Learning: Transforming Engineering and Science The MathWorks, Inc. Deep Learning: Transforming Engineering and Science 1 2015 The MathWorks, Inc. DEEP LEARNING: TRANSFORMING ENGINEERING AND SCIENCE A THE NEW RISE ERA OF OF GPU COMPUTING 3 NVIDIA A IS NEW THE WORLD S ERA

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Deep Learning Frameworks with Spark and GPUs

Deep Learning Frameworks with Spark and GPUs Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs

The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs The HammerBlade: An ML-Optimized Supercomputer for ML and Graphs Prof. Michael B. Taylor (PI) University of Washington Prof. Adrian Sampson Cornell University Prof. Luis Ceze University of Washington Prof.

More information

How to Build Optimized ML Applications with Arm Software

How to Build Optimized ML Applications with Arm Software How to Build Optimized ML Applications with Arm Software Arm Technical Symposia 2018 ML Group Overview Today we will talk about applied machine learning (ML) on Arm. My aim for today is to show you just

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016

NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING. September 13, 2016 NVIDIA AI BRAIN OF SELF DRIVING AND HD MAPPING September 13, 2016 AI FOR AUTONOMOUS DRIVING MAPPING KALDI LOCALIZATION DRIVENET Training on DGX-1 NVIDIA DGX-1 NVIDIA DRIVE PX 2 Driving with DriveWorks

More information

DGX UPDATE. Customer Presentation Deck May 8, 2017

DGX UPDATE. Customer Presentation Deck May 8, 2017 DGX UPDATE Customer Presentation Deck May 8, 2017 NVIDIA DGX-1: The World s Fastest AI Supercomputer FASTEST PATH TO DEEP LEARNING EFFORTLESS PRODUCTIVITY REVOLUTIONARY AI PERFORMANCE Fully-integrated

More information

Deep learning in MATLAB From Concept to CUDA Code

Deep learning in MATLAB From Concept to CUDA Code Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,

More information

gpucc: An Open-Source GPGPU Compiler

gpucc: An Open-Source GPGPU Compiler gpucc: An Open-Source GPGPU Compiler Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt One-Slide Overview Motivation

More information

Inside Broker How Broker Leverages the C++ Actor Framework (CAF)

Inside Broker How Broker Leverages the C++ Actor Framework (CAF) Inside Broker How Broker Leverages the C++ Actor Framework (CAF) Dominik Charousset inet RG, Department of Computer Science Hamburg University of Applied Sciences Bro4Pros, February 2017 1 What was Broker

More information

SIGGRAPH Briefing August 2014

SIGGRAPH Briefing August 2014 Copyright Khronos Group 2014 - Page 1 SIGGRAPH Briefing August 2014 Neil Trevett VP Mobile Ecosystem, NVIDIA President, Khronos Copyright Khronos Group 2014 - Page 2 Significant Khronos API Ecosystem Advances

More information

How to Build Optimized ML Applications with Arm Software

How to Build Optimized ML Applications with Arm Software How to Build Optimized ML Applications with Arm Software Arm Technical Symposia 2018 Arm K.K. Senior FAE Ryuji Tanaka Overview Today we will talk about applied machine learning (ML) on Arm. My aim for

More information

Supporting GPUs in Docker Containers on Apache Mesos

Supporting GPUs in Docker Containers on Apache Mesos Supporting GPUs in Docker Containers on Apache Mesos MesosCon Europe - 2016 Kevin Klues Senior Software Engineer Mesosphere Yubo Li Staff Researcher IBM Research China Kevin Klues Yubo Li Kevin Klues is

More information

Graph Streaming Processor

Graph Streaming Processor Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

The OpenVX Computer Vision and Neural Network Inference

The OpenVX Computer Vision and Neural Network Inference The OpenVX Computer and Neural Network Inference Standard for Portable, Efficient Code Radhakrishna Giduthuri Editor, OpenVX Khronos Group radha.giduthuri@amd.com @RadhaGiduthuri Copyright 2018 Khronos

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

LLVM for the future of Supercomputing

LLVM for the future of Supercomputing LLVM for the future of Supercomputing Hal Finkel hfinkel@anl.gov 2017-03-27 2017 European LLVM Developers' Meeting What is Supercomputing? Computing for large, tightly-coupled problems. Lots of computational

More information

Why data science is the new frontier in software development

Why data science is the new frontier in software development Why data science is the new frontier in software development And why every developer should care Jeff Prosise jeffpro@wintellect.com @jprosise Assertion #1 Being a programmer is like being the god of your

More information

Why AI Frameworks Need (not only) RDMA?

Why AI Frameworks Need (not only) RDMA? Why AI Frameworks Need (not only) RDMA? With Design and Implementation Experience of Networking Support on TensorFlow GDR, Apache MXNet, WeChat Amber, and Tencent Angel Bairen Yi (byi@connect.ust.hk) Jingrong

More information

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Advanced Topics on Heterogeneous System Architectures HSA foundation! Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library Synergy@VT Collaborators: Paul Sathre, Sriram Chivukula, Kaixi Hou, Tom Scogland, Harold Trease,

More information

CS516 Programming Languages and Compilers II

CS516 Programming Languages and Compilers II CS516 Programming Languages and Compilers II Zheng Zhang Spring 2015 Jan 22 Overview and GPU Programming I Rutgers University CS516 Course Information Staff Instructor: zheng zhang (eddy.zhengzhang@cs.rutgers.edu)

More information

Lecture 3: Overview of Deep Learning System. CSE599W: Spring 2018

Lecture 3: Overview of Deep Learning System. CSE599W: Spring 2018 Lecture 3: Overview of Deep Learning System CSE599W: Spring 2018 The Deep Learning Systems Juggle We won t focus on a specific one, but will discuss the common and useful elements of these systems Typical

More information

CNN optimization. Rassadin A

CNN optimization. Rassadin A CNN optimization Rassadin A. 01.2017-02.2017 What to optimize? Training stage time consumption (CPU / GPU) Inference stage time consumption (CPU / GPU) Training stage memory consumption Inference stage

More information

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional

More information

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor WELCOME TO Operant conditioning 1938 Spiking Neuron 1952 first neuroscience department 1964 Deep learning prevalence mid 2000s The Turing Machine 1936 Transistor 1947 First computer science department

More information

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,

More information

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler Taylor Lloyd Jose Nelson Amaral Ettore Tiotto University of Alberta University of Alberta IBM Canada 1 Why? 2 Supercomputer Power/Performance GPUs

More information

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017 DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X

More information

The Anatomy of Deep Learning Frameworks* *Everything you wanted to know about DL Frameworks but were afraid to ask

The Anatomy of Deep Learning Frameworks* *Everything you wanted to know about DL Frameworks but were afraid to ask The Anatomy of Deep Learning Frameworks* *Everything you wanted to know about DL Frameworks but were afraid to ask $whoami Master s Student in CS @ ETH Zurich, bachelors from BITS Pilani Contributor to

More information

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Efficient Edge Computing for Deep Neural Networks and Beyond Vivienne Sze In collaboration with Yu-Hsin Chen, Joel Emer, Tien-Ju Yang, Sertac

More information

OpenCL: History & Future. November 20, 2017

OpenCL: History & Future. November 20, 2017 Mitglied der Helmholtz-Gemeinschaft OpenCL: History & Future November 20, 2017 OpenCL Portable Heterogeneous Computing 2 APIs and 2 kernel languages C Platform Layer API OpenCL C and C++ kernel language

More information

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA Mark Harris, NVIDIA @harrism #NVSC14 EXTENDING THE REACH OF CUDA 1 Machine Learning 2 Higher Performance 3 New Platforms 4 New Languages 2 GPUS: THE

More information

Khronos Connects Software to Silicon

Khronos Connects Software to Silicon Press Pre-Briefing GDC 2015 Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem All Materials Embargoed Until Tuesday 3 rd March, 12:01AM Pacific Time Copyright Khronos Group 2015 - Page

More information

From Application to Technology OpenCL Application Processors Chung-Ho Chen

From Application to Technology OpenCL Application Processors Chung-Ho Chen From Application to Technology OpenCL Application Processors Chung-Ho Chen Computer Architecture and System Laboratory (CASLab) Department of Electrical Engineering and Institute of Computer and Communication

More information

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu

Bigger GPUs and Bigger Nodes. Carl Pearson PhD Candidate, advised by Professor Wen-Mei Hwu Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by Professor Wen-Mei Hwu 1 Outline Experiences from working with domain experts to develop GPU codes on Blue Waters

More information

Open-source Tools For GPU Programming in Large Classrooms

Open-source Tools For GPU Programming in Large Classrooms rai-project.com Open-source Tools For GPU Programming in Large Classrooms Abdul Dakkak, Carl Pearson, Cheng Li {dakkak,pearson,cli99}@illinois.edu WebGPU Originally Designed for MOOC Around 100k students

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

Onto Petaflops with Kubernetes

Onto Petaflops with Kubernetes Onto Petaflops with Kubernetes Vishnu Kannan Google Inc. vishh@google.com Key Takeaways Kubernetes can manage hardware accelerators at Scale Kubernetes provides a playground for ML ML journey with Kubernetes

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

Voice, Image, Video : AI in action with AWS. 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Voice, Image, Video : AI in action with AWS. 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Voice, Image, Video : AI in action with AWS A long heritage of machine learning at Amazon Personalized recommendations Fulfillment automation and inventory management Drones Voice driven interactions Inventing

More information

DECISION SUPPORT SYSTEM USING TENSOR FLOW

DECISION SUPPORT SYSTEM USING TENSOR FLOW Volume 118 No. 24 2018 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ DECISION SUPPORT SYSTEM USING TENSOR FLOW D.Anji Reddy 1, G.Narasimha 2, K.Srinivas

More information

Porting Fabric Engine to NVIDIA Unified Memory: A Case Study. Peter Zion Chief Architect Fabric Engine Inc.

Porting Fabric Engine to NVIDIA Unified Memory: A Case Study. Peter Zion Chief Architect Fabric Engine Inc. Porting Fabric Engine to NVIDIA Unified Memory: A Case Study Peter Zion Chief Architect Fabric Engine Inc. What is Fabric Engine? A high-performance platform for building 3D content creation applications,

More information

Beyond Hardware IP An overview of Arm development solutions

Beyond Hardware IP An overview of Arm development solutions Beyond Hardware IP An overview of Arm development solutions 2018 Arm Limited Arm Technical Symposia 2018 Advanced first design cost (US$ million) IC design complexity and cost aren t slowing down 542.2

More information

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation Accelerating GPU inferencing with DirectML and DirectX 12 Shrinath Shanbhag Senior Software Engineer Microsoft Corporation Machine Learning Machine learning has become immensely popular over the last decade

More information

Designing a Domain-specific Language to Simulate Particles. dan bailey

Designing a Domain-specific Language to Simulate Particles. dan bailey Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Hardware and Software. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 6-1

Hardware and Software. Fei-Fei Li & Justin Johnson & Serena Yeung. Lecture 6-1 Lecture 6: Hardware and Software Lecture 6-1 Administrative Assignment 1 was due yesterday. Assignment 2 is out, due Wed May 1. Project proposal due Wed April 24. Project-only office hours leading up to

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information