Research Faculty Summit Systems Fueling future disruptions

Size: px
Start display at page:

Download "Research Faculty Summit Systems Fueling future disruptions"

Transcription

1 Research Faculty Summit 2018 Systems Fueling future disruptions

2 Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia

3 System Challenge in Deep Learning Innovations are emerging very fast in deep learning area New DNN models and workload patterns RNN, CNN, GAN, reinforcement learning, graph neural network, etc. Diverse and emerging hardware accelerators, GPU, FPGA, ASICs, edge devices, NV-Link, RDMA, etc. Compiler stack is key to bridge framework and hardware Combine information of computation graph and hardware Optimize for both local execution and distributed scalability Critical for both training and inference Intermediate Representation (IR) x w * b + y Compiler & Optimizer Infrastructure Execution backends

4 Wolong: Optimizer Stack for Deep Learning System innovation to bridge application and hardware General computation graph optimization Software and hardware co-design Just-in-time compiler Transparent optimization Communication efficiency Accelerator execution efficiency Memory efficiency Tensor Placement Optimizer: Memory layout and placement optimization Intermediate Representation (graph of operators) Global Optimizer: Optimize distributed training over RDMA Graph analyzer/ RDMA memcpy library Local Optimizer: Optimize execution with JIT compiler Operator batching/ kernel fusion Execution Runtime CPU, GPU, RDMA devices

5 Global Optimizer Fast Distributed Deep Learning Computation over RDMA

6 Distributed Dataflow Graph Execution Deep learning computation is modeled as dataflow graph Achieve parallel manner through graph partitioning Model parallelism vs. data parallelism Tensor transmission across server becomes bottlenecks Dispatch partitions Partition YY graph HH * σσ * WW 11 XX WW 22 Latency (ms) Send-Recv Benchmark grpc RDMA 10x Send HH σσ WW 11 * XX Recv YY * WW 2 message size (MB) Server 0 Server 1

7 General Message Passing Library (e.g., RPC) Unavoidable memory copy overhead in RPC Generally designed for dynamic data structure Lacks knowledge of actual data placement and size Extra memory copy from data serialization Software/hardware co-design to completely remove memory copy overhead Leverage runtime application information RDMA network HH Send σσ * WW 11 XX Application memory t RPC managed buffer t t YY * Recv WW 22 Application memory t RPC managed buffer t Server 0 Server 1

8 Combine Dataflow Graph Computation with RDMA Tensor abstraction in deep learning computation Consists of a plain byte array with sufficiently large size (tens of KB to MB) Do NOT require variant data serialization/deserialization Do NOT require extra batching since access pattern is already sequential RDMA enables to manage local and distributed memory in a unified view One-side RDMA R/W : efficient memory copy between host memory GPU-Direct RDMA : efficient memory copy between host and device memory Global graph optimizer for distributed computation Has the entire view and control of memory placement among devices and servers Capable of making globally optimized strategy for tensor data placement in runtime

9 Optimized Communication Mechanism Transfer statically placed tensor through one-side RDMA write Phase I: graph analyzing RDMA-based zero-copy communication Phase II: graph execution Send HH σσ * WW 11 XX... Source Tensor 1 RDMA lib: Conduct remote memory copy One-sided RDMA write (Polling flag byte) Dest Tensor Recv YY * WW 22 Server 0 Tensor Manager: Detect the source tensor place Re-allocate as RDMA memory Tensor Manager: Pre-allocate RDMA compatible receive tensor Server 1

10 Global Optimizer: Performance Evaluation Improve training throughput, convergence speed and scalability Deep Learning Benchmarks Convergence of Seq2Seq Translation Throughput (mini-batchs/sec) x 2.3x 2.6x 4.2x 4.0x 8.1x AlexNet Inception FC LSTM GRU VGG16 Perplexity x speed up Run time (s) TensorFlow(gRPC) Wolong TensorFlow(gRPC) Wolong(RDMA) More details in our paper: RPC Considered Harmful: Fast Distributed Deep Learning on RDMA * Experiments are conducted on 8 servers 8 Nvidia GTX 1080 GPUs; The translation model uses WTM 15 datasets;

11 Local Optimizer Kernel Fusion for Deep Learning on GPU

12 Motivation Deep learning frameworks model computation as graph of primitive operators Expressivity to represent arbitrary neural network structure Flexibility to run on multi-device and multi-server through graph partitioning Significant framework overhead to schedule thousands of operators Kernel-launch overhead Cross operator communication overhead Too fine-grained to leverage vendor s library Example: 80-step LSTM model Contains 1686 operators in TensorFlow runtime (ms) LSTM 512x512 (80steps) TensorFlow CUDNN FuseKernel

13 DL Frameworks vs. Vendor Provided Library Flexibility Deep learning frameworks E.g., TensorFlow, PyTorch, CNTK Embrace flexibility and expressivity Performance inefficiency DL framework + Compiler Generate library-like code in runtime Win both of the worlds Hardware specific library E.g., cudnn, cublas, MKL Designed for extreme efficiency Impossible to handle customized or new network structure Efficiency

14 Wolong Compiler Design Computation graph level optimization Graph rewriting based on computational equivalence Common subexpression elimination, constant folding etc. Operator batching: automatically batch same type operators to better leverage batch efficiency Target and application specific runtime compilation Static shape and type inference Static memory planning Aggressive kernel fusion Computational Graph (IR) x * + y w b Graph level optimizer Target specific JIT compiler Execution runtime

15 Wolong Compiler Execution Workflow Detect optimize subgraph Code generation LSTM Relu Sigmoid Mul BiasAdd Shape inference Memory planning JIT compile FusedKernel Input graph Operator batching Cache hit in later iterations Rewrite graph Output FusedKernel Before Graph Execution Runtime Optimization Input Input

16 Graph Level Optimization: Operator Batching Automatically conduct GEMM fusion and static memory placement optimization M 0 [2,4] M 2 [4,1] M 1 [2,4] Concat(M 0, M 1 ) M 2 [4,1] Tensor M 0 (2,4) Tensor M 1 (2,4) O 0 [2,1] O 1 [2,1] Concat(O 0, O 1 ) Concat Tensor M 2 (4,4)

17 JIT Compilation: Kernel Fusion Leverage aggressive kernel fusion to completely remove scheduling overhead Element-wise (i.e., point-wise) operators No cross-element dependency between operators Better leverage cache, register locality h = ssssssssssssss(xx 1 + xx 2 ) h Sig Add xx 1 xx h xx 1 xx 2 global void kernel_0(float *x1, float *x2, float *h) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx < 1024) { float temp0 = x1[idx] + x2[idx]; float temp1 = sigmoidf(temp0); h[idx] = temp1; } }

18 JIT Compilation: Kernel Fusion Fuse arbitrary (non element-wise) operators in to single kernel Operator data dependency may introduce cross threads data dependency in kernel Need global synchronization to guarantee correctness Cross operator communication uses device memory E.g., fuse two matrix multiplications: Z = AA BB CC AA MM MM BB ZZ CC void kernel_0(float *A, float *B, float *C, float *Z) { if (idx < 1024) { buffer[idx] = _f(a, B); Global_Sync(); Z[idx] = _f(buffer, C); h[idx] = temp1; } }

19 Graph Computation in DL Frameworks Operators (kernels) are scheduled (launched) one by one CPU SM SM SM SM SM SM PCIe GPU block0 block1 block2 block3 block4 block5 Overhead of kernel launching, DAG scheduling, memory copy, etc. Conv block6 block7 block8 block0 block1 block2 block3 block4 block5 block6 block7 block8 block9 block1 0

20 Arbitrary Kernel Fusion Is Limited by GPU Architechture Hard to conduct global synchronization across all threads PCIe GPU CPU SM SM SM SM SM SM Conv barrier Fused as single kernel block0 Conv block6 Conv block1 Conv block7 Conv block2 Conv block8 Conv block3 Conv block9 Conv block4 Conv block10 Conv block5 Conv Never complete due to waiting for barrier Never be scheduled

21 Our Solution: Persistent Threads and Virtual Blocks Assign virtual block task to persistent threads PCIe GPU CPU SM SM SM SM SM SM block0 block1 block2 block3 block4 block5 Conv barrier vblock0 vblock1 vblock2 vblock3 vblock4 vblock5 vblock6 vblock7 vblock8 Fused as single kernel vblock0 vblock1 vblock2 vblock3 vblock4 vblock5 vblock6 vblock7 vblock8 vblock9 vblock1 0

22 Kernel Packing Explore graph level parallelism in static code generation GPU PCIe CPU SM SM SM SM SM SM [64, 512] [512, 128] block0 block1 block2 block3 block4 block5 vblock0 vblock1 Under utilization of GPU resource σσ AAAAAA vblock0 vblock1 vblock2 vblock3 vblock0 vblock1 vblock2 vblock3 vblock4 vblock5 [64, 128] vblock6

23 Code Generation Code Generator Operator kernels: device functions A σσ Relu C B //kernel code generated by Wolong compiler global void kernel_0(float *A, float *B, float *C) { int idx = blockidx.x * blockdim.x + threadidx.x; if (idx < 1024) { float temp0 = sigmoidf(a[idx]); buffer0[idx] = temp0; } GlobalSync(); for (int tix = bx; tix < ey; tix += offx) { for (int tiy = by; tiy < ey; tiy += offy) { (buffer0, B, buffer1, 1024, 1024, 128); } } if (idx < 1024) { float temp2 = reluf(temp1); C[idx] = temp2; } } //device function for sigmoid operator device float sigmoidf(float in) { return 1.f / (1.f + expf(-in)); } //device function for relu operator device float reluf(float in) { return fmaxf(0.f, in); } Operator device functions //device function for operator device float (float *a, float *b, float *c, int m, int n, int k) { if (thread_ix < m && thread_iy < k) { float temp = 0.f; for (int i = 0; i < n; ++i) { temp += a[tix*n+i] * b[k*i+tiy]; } c[k * tix + tiy] = temp; } }, Add, Mul, Sub, Div, Relu, Sigmoid, Tanh, Split, Max, Min, Convolution, etc.

24 Performance of End-to-end Kernel Fusion RNN inference benchmark (LSTM-128uints-80steps) Fused 1686 operators into 1 kernel Wolong fused_kernel Avg Runtime(ms) LSTM Benchmark 10.9x TensorFlow XLA OpBatch Fusion Experiments are conducted on Nvidia GTX 1080 Ti GPUs

25 Conclusion A compiler infrastructure is critical for both cloud and edge AI Optimize for fast distributed training in cloud Optimize for efficient inference on accelerator devices System innovations to bridge applications and diverse hardware Common intermediate representation (IR) Co-design software and hardware for extreme efficiency Wolong prototype has demonstrated the initial improvements Up to 8x speedup on training workloads Up to 10x speedup on inference benchmark

26 Thank You! Systems Fueling future disruptions

27

28 Distributed Graph Optimizer of Wolong Transfer dynamically allocated tensor through RDMA write/read Phase I: graph analyzing Supports GPUDirect RDMA as well Phase II: graph execution Send HH σσ * WW 11 XX Tensor meta data 1... One-sided RDMA write One-sided RDMA read Tensor meta data 01 Allocate... Recv YY * WW 22 Source Tensor Dest Tensor Server 0 Server 1

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13

GPU FOR DEEP LEARNING. 周国峰 Wuhan University 2017/10/13 GPU FOR DEEP LEARNING chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 Why Deep Learning Boost Today? Nvidia SDK for Deep Learning? Agenda CUDA 8.0 cudnn TensorRT (GIE) NCCL DIGITS 2 Why Deep Learning

More information

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CafeGPI. Single-Sided Communication for Scalable Deep Learning CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Automatic Code Generation TVM Stack

Automatic Code Generation TVM Stack Automatic Code Generation TVM Stack CSE 599W Spring TVM stack is an active project by saml.cs.washington.edu and many partners in the open source community The Gap between Framework and Hardware Frameworks

More information

End to End Optimization Stack for Deep Learning

End to End Optimization Stack for Deep Learning End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team

More information

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

NVIDIA FOR DEEP LEARNING. Bill Veenhuis NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA

More information

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder EFFICIENT INFERENCE WITH TENSORRT Han Vanholder AI INFERENCING IS EXPLODING 2 Trillion Messages Per Day On LinkedIn 500M Daily active users of iflytek 140 Billion Words Per Day Translated by Google 60

More information

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018 Relay: a high level differentiable IR Jared Roesch TVMConf December 12th, 2018!1 This represents months of joint work with lots of great folks:!2 TVM Stack Optimization Relay High-Level Differentiable

More information

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017 Deep Learning on Modern Architectures Keren Zhou 4/17/2017 HPC Software Stack Application Algorithm Data Layout CPU GPU MIC Others HPC Software Stack Deep Learning Algorithm Data Layout CPU GPU MIC Others

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation Accelerating GPU inferencing with DirectML and DirectX 12 Shrinath Shanbhag Senior Software Engineer Microsoft Corporation Machine Learning Machine learning has become immensely popular over the last decade

More information

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018 Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks

More information

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain TensorFlow: A System for Learning-Scale Machine Learning Google Brain The Problem Machine learning is everywhere This is in large part due to: 1. Invention of more sophisticated machine learning models

More information

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability Janis Keuper Itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

TVM Stack Overview. Tianqi Chen

TVM Stack Overview. Tianqi Chen TVM Stack Overview Tianqi Chen Beginning of TVM Story Acclerator Beginning of TVM Story Acclerator Beginning of TVM Story Beginning of TVM Story Acclerator // Pseudo-code for convolution program for the

More information

Deep learning in MATLAB From Concept to CUDA Code

Deep learning in MATLAB From Concept to CUDA Code Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

High-Performance Data Loading and Augmentation for Deep Neural Network Training

High-Performance Data Loading and Augmentation for Deep Neural Network Training High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB

GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB GPU Coder: Automatic CUDA and TensorRT code generation from MATLAB Ram Kokku 2018 The MathWorks, Inc. 1 GPUs and CUDA programming faster Performance CUDA OpenCL C/C++ GPU Coder MATLAB Python Ease of programming

More information

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA Pierre Nowodzienski Engineer pierre.nowodzienski@mathworks.fr 2018 The MathWorks, Inc. 1 From Data to Business value Make decisions Get

More information

The OpenVX Computer Vision and Neural Network Inference

The OpenVX Computer Vision and Neural Network Inference The OpenVX Computer and Neural Network Inference Standard for Portable, Efficient Code Radhakrishna Giduthuri Editor, OpenVX Khronos Group radha.giduthuri@amd.com @RadhaGiduthuri Copyright 2018 Khronos

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS

OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS April 4-7, 2016 Silicon Valley OPTIMIZING PERFORMANCE OF RECURRENT NEURAL NETWORKS Jeremy Appleyard, 7 April 2016 RECURRENT NEURAL NETWORKS Output is fed into input Perform the same operation repeatedly

More information

Realtime Signal Processing on Nvidia TX2 using CUDA

Realtime Signal Processing on Nvidia TX2 using CUDA Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Dr. Amin Mazloumian Dr. Matthias Rosenthal Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of

More information

Operator Vectorization Library A TensorFlow Plugin

Operator Vectorization Library A TensorFlow Plugin Operator Vectorization Library A TensorFlow Plugin Matthew Pickett, Karen Brems, Florian Raudies Hewlett Packard Labs HPE-2016-94 Keyword(s): Machine learning; GPU acceleration; TensorFlow Abstract: TensorFlow

More information

GPU-Accelerated Deep Learning

GPU-Accelerated Deep Learning GPU-Accelerated Deep Learning July 6 th, 2016. Greg Heinrich. Credits: Alison B. Lowndes, Julie Bernauer, Leo K. Tam. PRACTICAL DEEP LEARNING EXAMPLES Image Classification, Object Detection, Localization,

More information

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017

A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017 A NEW COMPUTING ERA JENSEN HUANG, FOUNDER & CEO GTC CHINA 2017 TWO FORCES DRIVING THE FUTURE OF COMPUTING 10 7 Transistors (thousands) 10 6 10 5 1.1X per year 10 4 10 3 10 2 1.5X per year Single-threaded

More information

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos

More information

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018 Speculations about Computer Architecture in Next Three Years shuchang.zhou@gmail.com Jan. 20, 2018 About me https://zsc.github.io/ Source-to-source transformation Cache simulation Compiler Optimization

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4

CS/CoE 1541 Final exam (Fall 2017). This is the cumulative final exam given in the Fall of Question 1 (12 points): was on Chapter 4 CS/CoE 1541 Final exam (Fall 2017). Name: This is the cumulative final exam given in the Fall of 2017. Question 1 (12 points): was on Chapter 4 Question 2 (13 points): was on Chapter 4 For Exam 2, you

More information

Towards Scalable Machine Learning

Towards Scalable Machine Learning Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Fraunhofer Center Machnine Larning Outline I Introduction

More information

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning Koichi Shirahata, Youri Coppens, Takuya Fukagai, Yasumoto Tomita, and Atsushi Ike FUJITSU LABORATORIES LTD. March 27, 2018 0 Deep

More information

Enabling the future of Artificial intelligence

Enabling the future of Artificial intelligence Enabling the future of Artificial intelligence Contents AI Overview Intel Nervana AI products Hardware Software Intel Nervana Deep Learning Platform Learn more - Intel Nervana AI Academy Artificial Intelligence,

More information

Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach

Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach Jiawen Liu*, Hengyu Zhao*, Matheus Almeida

More information

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

Scaling Neural Network Acceleration using Coarse-Grained Parallelism Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)

More information

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction

More information

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Scalable Distributed Training with Parameter Hub: a whirlwind tour Scalable Distributed Training with Parameter Hub: a whirlwind tour TVM Stack Optimization High-Level Differentiable IR Tensor Expression IR AutoTVM LLVM, CUDA, Metal VTA AutoVTA Edge FPGA Cloud FPGA ASIC

More information

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Layer-wise Performance Bottleneck Analysis of Deep Neural Networks Hengyu Zhao, Colin Weinshenker*, Mohamed Ibrahim*, Adwait Jog*, Jishen Zhao University of California, Santa Cruz, *The College of William

More information

Graph Streaming Processor

Graph Streaming Processor Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer

More information

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures

Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Characterizing and Benchmarking Deep Learning Systems on Modern Data Center Architectures Talk at Bench 2018 by Xiaoyi Lu The Ohio State University E-mail: luxi@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi

More information

Profiling GPU Code. Jeremy Appleyard, February 2016

Profiling GPU Code. Jeremy Appleyard, February 2016 Profiling GPU Code Jeremy Appleyard, February 2016 What is Profiling? Measuring Performance Measuring application performance Usually the aim is to reduce runtime Simple profiling: How long does an operation

More information

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

GPU Programming with Ateji PX June 8 th Ateji All rights reserved. GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved. Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get

More information

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI Overview Unparalleled Value Product Portfolio Software Platform From Desk to Data Center to Cloud Summary AI researchers depend on computing performance to gain

More information

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Florian Tramèr (joint work with Dan Boneh) Intel, Santa Clara August 30 th 2018 Trusted execution of ML: 3 motivating

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

Intelligent Hybrid Flash Management

Intelligent Hybrid Flash Management Intelligent Hybrid Flash Management Jérôme Gaysse Senior Technology&Market Analyst jerome.gaysse@silinnov-consulting.com Flash Memory Summit 2018 Santa Clara, CA 1 Research context Analysis of system &

More information

Code Optimizations for High Performance GPU Computing

Code Optimizations for High Performance GPU Computing Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate

More information

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017

DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE. Dennis Lui August 2017 DEEP NEURAL NETWORKS CHANGING THE AUTONOMOUS VEHICLE LANDSCAPE Dennis Lui August 2017 THE RISE OF GPU COMPUTING APPLICATIONS 10 7 10 6 GPU-Computing perf 1.5X per year 1000X by 2025 ALGORITHMS 10 5 1.1X

More information

Programming in CUDA. Malik M Khan

Programming in CUDA. Malik M Khan Programming in CUDA October 21, 2010 Malik M Khan Outline Reminder of CUDA Architecture Execution Model - Brief mention of control flow Heterogeneous Memory Hierarchy - Locality through data placement

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional

More information

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi

OPTIMIZED GPU KERNELS FOR DEEP LEARNING. Amir Khosrowshahi OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17 Mar 2015 Outline About nervana Optimizing deep learning at assembler level Limited precision for deep learning neon benchmarks 2 About nervana

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters

Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters Hao Zhang Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jianliang Wei, Pengtao Xie,

More information

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius

MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety

More information

Module Memory and Data Locality

Module Memory and Data Locality GPU Teaching Kit Accelerated Computing Module 4.4 - Memory and Data Locality Tiled Matrix Multiplication Kernel Objective To learn to write a tiled matrix-multiplication kernel Loading and using tiles

More information

Lecture 6: Optimize for Hardware Backends. CSE599G1: Spring 2017

Lecture 6: Optimize for Hardware Backends. CSE599G1: Spring 2017 Lecture 6: Optimize for Hardware Backends CSE599G1: Spring 2017 Where are we High level Packages User API Programming API Gradient Calculation (Differentiation API) System Components Computational Graph

More information

Wu Zhiwen.

Wu Zhiwen. Wu Zhiwen zhiwen.wu@intel.com Agenda Background information OpenCV DNN module OpenCL acceleration Vulkan backend Sample 2 What is OpenCV? Open Source Compute Vision (OpenCV) library 2500+ Optimized algorithms

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

Binary Convolutional Neural Network on RRAM

Binary Convolutional Neural Network on RRAM Binary Convolutional Neural Network on RRAM Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang Dept. of E.E, Tsinghua National Laboratory for Information Science and Technology (TNList) Tsinghua

More information

Pluto A Distributed Heterogeneous Deep Learning Framework. Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud

Pluto A Distributed Heterogeneous Deep Learning Framework. Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud Pluto A Distributed Heterogeneous Deep Learning Framework Jun Yang, Yan Chen Large Scale Learning, Alibaba Cloud Outline PAI(Platform of Artificial Intelligence) PAI Overview Deep Learning with PAI Pluto

More information

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018 VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Florian Tramèr (joint work with Dan Boneh) Stanford security lunch June 13 th Trusted execution of ML: 3 motivating

More information

Implementing Deep Learning for Video Analytics on Tegra X1.

Implementing Deep Learning for Video Analytics on Tegra X1. Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 倍以上速く 本当に可能ですか? 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY POSSIBLE?

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

Deep Learning Compiler

Deep Learning Compiler Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,

More information

Inference Engine compiler and SDK FWDNXT 2018

Inference Engine compiler and SDK FWDNXT 2018 Inference Engine compiler and SDK 1 Deep Learning processor Best performance per power Best utilization Efficient use of memory bandwidth Low latency Scalability: IoT to cloud 2 Deep Learning processor

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018 Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Matchbox. automatic batching for imperative deep learning NVIDIA GTC, 2018/3/28. James Bradbury

Matchbox. automatic batching for imperative deep learning NVIDIA GTC, 2018/3/28. James Bradbury Matchbox automatic batching for imperative deep learning James Bradbury NVIDIA GTC, 2018/3/28 Roadmap Imperative deep learning How manual batching works Other tools for automatic batching The Matchbox

More information

AutoTVM & Device Fleet

AutoTVM & Device Fleet AutoTVM & Device Fleet ` Learning to Optimize Tensor Programs Frameworks High-level data flow graph and optimizations Hardware Learning to Optimize Tensor Programs Frameworks High-level data flow graph

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100

More information